From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BB037D10BFE
	for <linux-mm@archiver.kernel.org>; Sun, 27 Oct 2024 07:07:21 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 035BF6B0088; Sun, 27 Oct 2024 03:07:21 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id F276F6B0089; Sun, 27 Oct 2024 03:07:20 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DEF446B008A; Sun, 27 Oct 2024 03:07:20 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id C0C846B0088
	for <linux-mm@kvack.org>; Sun, 27 Oct 2024 03:07:20 -0400 (EDT)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id EA43B41963
	for <linux-mm@kvack.org>; Sun, 27 Oct 2024 07:07:08 +0000 (UTC)
X-FDA: 82718500494.15.DE6F547
Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176])
	by imf11.hostedemail.com (Postfix) with ESMTP id 052784001D
	for <linux-mm@kvack.org>; Sun, 27 Oct 2024 07:06:51 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=gERJ7SF6;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf11.hostedemail.com: domain of hughd@google.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=hughd@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730012666;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=535OYr/jfByeF7vb/lMoiQddbcFE/G2VH6IgzW9mzWc=;
	b=W4myOXlvzAlFFE3ArmOMXesLZavwRemjYSRbWRxTQ5L/nt6nVHrJFBJdEayMhSTlr7dwu2
	oXHIQQTMt2wCkZ1Fwmx/z8URfElHhGXGt9ok0lCjgHIpxj7hHrQ+TTv/ebqqoRpsFVnvbT
	WtN7JlhTMABmlDXw998PkWzw3w76Ga4=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730012666; a=rsa-sha256;
	cv=none;
	b=v3ysHUPJsu+IGih5St8xZLSPeJdwaDBKhVTLuZ5ABACRlbSjBGEghYWwo7z87AU7zBthbf
	DgTrP30kWkC57fAxnRKz7tXvPUusaWR4OA212VvdjWU4OsJYbLIfcDQ9k4wQp69N6oNQ2E
	mgxhYWAIJq0dD7Tz0Nqho/XeVzX2id8=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=gERJ7SF6;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf11.hostedemail.com: domain of hughd@google.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=hughd@google.com
Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-20cdb889222so30534505ad.3
        for <linux-mm@kvack.org>; Sun, 27 Oct 2024 00:07:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1730012837; x=1730617637; darn=kvack.org;
        h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=535OYr/jfByeF7vb/lMoiQddbcFE/G2VH6IgzW9mzWc=;
        b=gERJ7SF6PFzka+JF5kLbgdRqDWjTnFkT6nlW/VeTwyrmWK7EPxLn594KGN1T4A0qpE
         QnyQMp1q06zEfqigYgSavtBicBzero1yMzVL/qnrkYCoU+wBgAPIL+wER/vbB6JJisyg
         xDQNnM7BBUOkQAPDkTeqM6QcMtuyl0OSX8GVi1HfA9ZfxoYyFGcC31nBxp4WsrlZjvG9
         g8DAhivxD+nu/LR6FZr3nAsg0E1jCwE5IOZRsDKezmdFZL4gHZgveSuXJ0RN6kPIqV5N
         uFXOQAE4JexXx6gPyFWNfT6mMP1ky/ow9ZFGQh1npkk1pX0Nkt76IWTeGW0Armr13Ndo
         e04g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1730012837; x=1730617637;
        h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=535OYr/jfByeF7vb/lMoiQddbcFE/G2VH6IgzW9mzWc=;
        b=wm8SGkBWvPf5k7CNFYimWqJEEcJW2ia+QK8Kx5i72ytRSpvbVNRFshgDoxh20XrYHG
         6j0Nx+O+Q0WMI1YDHKOPKOVSbCPe86EpIJol4v714tvwuko01KV9X13ZY4AkFcgEeUKK
         LNGKm1rsFjssbqrAB8pAIOOUEYulHfCcR7Q1MQpRB1DWbFws5f+w3onMFzE6xOWs38BI
         SgItCYjkD2/BlC6JZWnscE+CyCQzapw7FC6Rs0hb1T1cP1wreEtISqgDcNb9T9m/sPGg
         EH8a/chMNHhDhrt4voD66axlPAzWxYM7XZpqXKVORCoW9JrZrcby5RuibriL6Ya+fsQD
         podg==
X-Forwarded-Encrypted: i=1; AJvYcCWzAIL65vG8tWwv6L9+O2PUtIFR1ZA7I+GiUuPkvvbSxjXCAhUV3vzZf5c7cqC6a/0GvCcE7o3vHA==@kvack.org
X-Gm-Message-State: AOJu0YwYevR956LS+gm1LeKyse8BZgojQ98J+qmmB3CSScFc+Ymmd7cx
	zkFkI7ktwTBo0gcFSGE7do84S/QbRKoEoNVNICiuAKtmvoZMr+pU7gOGftzcFw==
X-Google-Smtp-Source: AGHT+IGp5PwV0sZMbauMxRAnVyP54+XoSNxzgw9sbiX0/XSDbvqYOBUw+X4L/CR3bX48lMn9XXrgpw==
X-Received: by 2002:a17:902:f686:b0:20b:8ef3:67a with SMTP id d9443c01a7336-210c68744a1mr73244095ad.7.1730012836834;
        Sun, 27 Oct 2024 00:07:16 -0700 (PDT)
Received: from darker.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id d9443c01a7336-210bbf72006sm32151825ad.108.2024.10.27.00.07.15
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 27 Oct 2024 00:07:16 -0700 (PDT)
Date: Sun, 27 Oct 2024 00:07:14 -0700 (PDT)
From: Hugh Dickins <hughd@google.com>
To: David Hildenbrand <david@redhat.com>
cc: Hugh Dickins <hughd@google.com>, Andrew Morton <akpm@linux-foundation.org>, 
    Usama Arif <usamaarif642@gmail.com>, Yang Shi <shy828301@gmail.com>, 
    Wei Yang <richard.weiyang@gmail.com>, 
    "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, 
    Matthew Wilcox <willy@infradead.org>, Johannes Weiner <hannes@cmpxchg.org>, 
    Baolin Wang <baolin.wang@linux.alibaba.com>, 
    Barry Song <baohua@kernel.org>, Kefeng Wang <wangkefeng.wang@huawei.com>, 
    Ryan Roberts <ryan.roberts@arm.com>, Nhat Pham <nphamcs@gmail.com>, 
    Zi Yan <ziy@nvidia.com>, Chris Li <chrisl@kernel.org>, 
    linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming
 and locking
In-Reply-To: <a5a37ff4-3017-4950-b625-d82871139ebc@redhat.com>
Message-ID: <3f4fc787-c1d4-4d29-355e-d5ae0737e55c@google.com>
References: <760237a3-69d6-9197-432d-0306d52c048a@google.com> <7dc6b280-cd87-acd1-1124-e512e3d2217d@google.com> <a5a37ff4-3017-4950-b625-d82871139ebc@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 052784001D
X-Stat-Signature: drbaqd5t4ftsssyibosjbf97oooaz8aj
X-Rspam-User: 
X-HE-Tag: 1730012811-788683
X-HE-Meta: U2FsdGVkX1805daqTpRmPIS2yPvgHKk5l1kZJoxi1IA/3N61hM/qkLHFsTjpqgBkKYcVlJRzx11GPDMv7j0Abnc/0d9Lls/3gLzQxMFTk9AChW0fk5qQmxtKXeha1JbyYN6DieoPN35CHlEQOpIdPwCKKbpXmg/WQlRXTj/oIMW5EMk8MiNkFGG8p6V2U/Qn9t3UUguzKGO3GQ0QCVSaOzMwmn5n2dH4rO0rP4Kkqn6Wb7ROT1v4DW6yvgZr0kqdq06hLsJpue+1chLhfLEhfmmY/7tg6dsYsT1eQOHmFddPpovxflaJqwLyipsV4T3nxLsx1aWS5AJ1rjcVGtNQKTUkMb6vw3WasIWm0bfgtkWkG92J22ldKX2pajDH4T1HXIzyULVnXiWuVpg/0cwFJTdmCUOAn6GmGj27sZxzdgSX/ACHCARw8Mimq/HTkNOyKN3YvmtcphpUnjAC883KxiI57SboY0jzog9Zg4+y5dZbpMWoLb4eAxoc5mjN6xCjBMQrf4s6kUfC1vQCRQ8SU/b1LXpW5iNixqL9NEzuGk+kneZXTA43df4RMEpWPWbRAdsB1Y93VkWyh0iZR/4s0D/9YtxqGtM0vOP3kAC48xUkg6yZkU4Kp/a2YncRruPgGTnNoGex1SMSNuDUXQ37PND6Ze9GqRy0VuiXaGCRFLXRMgDTtXhfgOvsRvKNc0ji3/JoIWrwCmMym/nbjGk0HLMaTWar2EUpXZJtWU13FA8Jkp3Fbk5k1hIxK7X1JTSmWdMjwriGdoKRETl3fuagR/iXsTsq4F+9+RDpOeXdUpAnj5KDBhqgzopEAXudKp/ejNawAz/9guBDYyV3dt3muVnwnwsU6qgtbQArMgYHknCGUe7zKyUQYRCsdK2oamncYSVNZTTbMzQCOff16S1bsECuKPRtWpe1ZzWje3uYhGx/zSykAKD9EjL3lhU9qTvR3A35sFvVBVI3nf9hZZ5
 dASS+3oe
 3kj7PmDSrxgk6xxQu+MY2n7NhvaAZvDvEH+Mmcz4v5l+115+2FiPSa5mNYcBw8dNVmPwgWf6s2Gy3y3h4WmuTwavEszFgqm2rsht/MsaMS/LCU7pXo9llm9xU1agexOuPdo25mh0XFHVgU6Qt0ylRrEbwdKJPIa+5YbXS+T1jo4Y+r6qm5CEXjifAaJbOL410ZR7U54R9f8v8h0eOg3Dr55+OCnyEL/fZpP26ZI9E+cJq7tEHXGNiFDM+E6cnY7oXUadfhcu6fR3tfAaVGRY0w+5tB2SknO1FmL39EopF6HHxcxz329yGCLwTZAFvNbrqhr/00IZP0/gSyEzH/8oWBYTDwswxPRtS7PNo40SwXK0owLRU5ZvASKg9JE7Rc/QqwPHjqAVEjQqjNvJTA8Pm/25zV3EqPAZxZBfSJJ4YwIAyKIIo+xKmEsevrQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, 24 Oct 2024, David Hildenbrand wrote:
> On 24.10.24 06:13, Hugh Dickins wrote:
> > Recent changes are putting more pressure on THP deferred split queues:
> > under load revealing long-standing races, causing list_del corruptions,
> > "Bad page state"s and worse (I keep BUGs in both of those, so usually
> > don't get to see how badly they end up without).  The relevant recent
> > changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> > improved swap allocation, and underused THP splitting.
> > 
> > Before fixing locking: rename misleading folio_undo_large_rmappable(),
> > which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> > which is what it does.  
> 
> Yes please. I stumbled into that myself recently -- leftover from previous
> rework.
> 
> It would have been reasonable to move that into a separate (follow-up?) patch.

I was expecting someone to ask for a breakup: I plead, as the one who
will try some backports, to keep this all in one to make that job easier.

I doubt I'll go back further than 6.6 LTS.  Porting this patch is easy,
the hard part is gathering together the little bits and pieces it also
needs to be good (I'm thinking of folio migration freezing anon to 0
in particular, that will lead to others).

> > The 87eaceb3faa5 commit did have the code to move from one deferred list
> > to another (but was not conscious of its unsafety while refcount non-0);
> > but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> > care deferred split queue in memcg charge move path"), which argued that
> > the existence of a PMD mapping guarantees that the THP cannot be on a
> > deferred list.  
> 
> I recall this can happen, not sure on 5.6 though: assume we have an anon THP
> with 1 PMD mapping and a single PTE mapping for simplicity.
> 
> Assume we want to migrate that folio and first remove the PMD mapping, then
> the PTE mapping. After removing the PMD mapping, we add it to the deferred
> split queue (single PTE mapped).
> 
> Now assume migration fails and we remove migration entries -> remap.
> 
> We now have a PMD-mapped THP on the deferred split queue.
> 
> (again, I might be wrong but that's from memory without digging into the code)

I believe you're right, thank you for doing the heavy thinking for me.
Then I came up with 5.8's 5503fbf2b0b8 ("khugepaged: allow to
collapse PTE-mapped compound pages") introducing another way.

It looks to me like there is a case for backporting this patch,
but not a very strong case: worth some effort, but not too much -
I've not heard of the predicted issues in practice.

> 
> 
> > I'm not sure if that was true at the time (swapcache
> > remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> > ("mm: split underused THPs").
> 
> We only remap PTEs from the swapcache, never PMDs.

Right, thanks.

> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool
> > partially_mapped)
> >    if (!partially_mapped && !split_underused_thp)
> >     return;
> >   -	/*
> > -	 * The try_to_unmap() in page reclaim path might reach here too,
> > -	 * this may cause a race condition to corrupt deferred split queue.
> > -	 * And, if page reclaim is already handling the same folio, it is
> > -	 * unnecessary to handle it again in shrinker.
> > -	 *
> > -	 * Check the swapcache flag to determine if the folio is being
> > -	 * handled by page reclaim since THP swap would add the folio into
> > -	 * swap cache before calling try_to_unmap().
> > -	 */
> > -	if (folio_test_swapcache(folio))
> > -		return;
> > -
> >    spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >    if (partially_mapped) {
> >   		if (!folio_test_partially_mapped(folio)) {

I'm changing this part in v2: keeping the return on swapcache, updating
the comment to say it's no longer essential, but a likely optimization.

I noticed the stats for deferred split were very different when I made
this change, largely because I'm doing a lot of unrealistic swapping
no doubt, but I don't want to take the risk of changing behaviour
when all we're trying to fix is list corruption.

> > diff --git a/mm/internal.h b/mm/internal.h
> > index 93083bbeeefa..16c1f3cd599e 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio
> > *folio, unsigned int order)
> >   #endif
> >   }
> >   
> > -void __folio_undo_large_rmappable(struct folio *folio);
> > -static inline void folio_undo_large_rmappable(struct folio *folio)
> > +bool __folio_unqueue_deferred_split(struct folio *folio);
> > +static inline bool folio_unqueue_deferred_split(struct folio *folio)
> >   {
> >    if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
> 
> The rmappable check here is still confusing for me. I assume we want to
> exclude hugetlb or others that reuse the field for other purposes ...

Yes, I believe that is what it's about.  Somewhere else excludes hugetlb
explicitly instead.  I know Matthew would have liked to find a better name
than "large_rmappable" (aren't hugetlb folios large and rmappable?), but
nobody suggested a better.

You've reminded me to worry about how hugetlb pages get through
free_tail_page_prepare()'s case 2 check for list_empty(_deferred_list).
Ah, there's an INIT_LIST_HEAD(&folio->_deferred_list) in mm/hugetlb.c,
before freeing a hugetlb folio.  (The kind of detail which may be
different in backports.)

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4b21a368b4e2..57f64b5d0004 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> >     unsigned long pfn = folio_pfn(folio);
> >     unsigned int order = folio_order(folio);
> >   -		folio_undo_large_rmappable(folio);
> > +		if (mem_cgroup_disabled())
> > +			folio_unqueue_deferred_split(folio);
> > +
> 
> Is it worth adding a comment here where we take care of ths with memcg
> enabled?

I've confessed to Yang Shi that I got this wrong: so in v2 there is
not even a line to add such a comment on (but comment in commit message).

> 
> It's complicated stuff, nothing jumped at me, but it's late here so my brain
> is not fully functional ...

Even your not fully functional brain is so much quicker than mine:
many thanks for applying it.

Hugh