From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1E509D149E7
	for <linux-mm@archiver.kernel.org>; Fri, 25 Oct 2024 18:36:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A479A6B0089; Fri, 25 Oct 2024 14:36:31 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9F70C6B008A; Fri, 25 Oct 2024 14:36:31 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8BEB36B008C; Fri, 25 Oct 2024 14:36:31 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 6E7316B0089
	for <linux-mm@kvack.org>; Fri, 25 Oct 2024 14:36:31 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id B9073C0EFA
	for <linux-mm@kvack.org>; Fri, 25 Oct 2024 18:36:09 +0000 (UTC)
X-FDA: 82712979594.22.251C21C
Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51])
	by imf15.hostedemail.com (Postfix) with ESMTP id D8B3FA0020
	for <linux-mm@kvack.org>; Fri, 25 Oct 2024 18:36:08 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Rk3b3zg9;
	spf=pass (imf15.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=shy828301@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729881336;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=FVqk1yJPQA/+HKOpSAlTiulTaUi2W5hnnPyvpTl0kVs=;
	b=oPc4PgHIlzZkcHYES8adER/VsxgbsrH0HGFg9KlOSMZXt6m1B9qjJxgvOWJvXOE2S/5Yhf
	rOtwOWX+P1LhPwzLB9sxiB8ni0/xANiwU7T4zMaRDn4iNn9tOgv5jLZFV+6xE2mb419B6U
	3opaAfIad0LJfxVnXAgOj97L48Ho3Z8=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Rk3b3zg9;
	spf=pass (imf15.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=shy828301@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729881336; a=rsa-sha256;
	cv=none;
	b=Zy2PSyGvjKhHlSmhoV1XZtO4KgmZtNeO45gSN7Hzxo68whaJp6MepKsj1nb3pS9LAaRe09
	yPcZjglyjRVkThNNQyal+iwP9cXIrfutXszkM/bvA7y+BxD3tOCcEFE8JPv/21/pZ5/scQ
	whHVaxX7+HMf/lyZ4AOdFhD1lR5F4tI=
Received: by mail-ed1-f51.google.com with SMTP id 4fb4d7f45d1cf-5c903f5bd0eso4501703a12.3
        for <linux-mm@kvack.org>; Fri, 25 Oct 2024 11:36:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1729881387; x=1730486187; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=FVqk1yJPQA/+HKOpSAlTiulTaUi2W5hnnPyvpTl0kVs=;
        b=Rk3b3zg9NjXQUU4VUUYsYpCGtrVKhbEt3PryMI+lEWO+bJiZUqotvIVkGYB5QqJN1g
         IsDI+bkltcE9GVHFE8T+6UWZBqJnG3mSz+hLFUlYKf5AhRTKGKFLHplGBT0ZUxHjDGFe
         MLcYTLP4tbgJra1z1qVdaGDMyFvHKQWDLNz+vnIG4Ktp9IhjVMD+n1Tlch9J1FYBXwnO
         ffOHBacz3Uter8xCbtu5MMsSmsA8DXK2m0ftdefKqg10nmpz2iTcaFO7UUrgb646ZcA3
         S0O1uDB6BHlIhIpiyNHvehx1BZ4k+e7svaNgDyunkNsYAsmSYOj3r1mW9pBrnNWo/reS
         mRVg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1729881387; x=1730486187;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=FVqk1yJPQA/+HKOpSAlTiulTaUi2W5hnnPyvpTl0kVs=;
        b=w91GeiI9qkqAXf2q4Dw5enwXSsic4wU9deFDoRygmEH2eQIJPjoA/xHRDGwtHyNCtn
         /h1R2ewpgZ1mOSGH8G6YdCc4R7TSja/nOGa8fA3alHxWqp6QGaOUw3KnLBnuq8z/7Y0M
         tnpRy+zX0U9PuNxnr3suNbjG5Kbkeqi1VJwI7xoGJf6VBBfODuhVAcmNupVidbFUx06C
         bPTh4Z/HQiPm2oR0L77LjzJa5v/yCreBrmLczNzNgDO9oksQt54GzIGvF0QrtN9QwAoY
         I+7jK3cQFMEqltcZ/tn6nVnd8O0/FFODuUTRhnZ3+kocCJnMRJLcg4jg6dhO3i4+xsTt
         QKOg==
X-Forwarded-Encrypted: i=1; AJvYcCXgmPHKCgC96lFd09blqBZUmFeLK6vQ3MAD+geGNP/VJsSbodcot5TXptosS2xum7jDZuHHWickdg==@kvack.org
X-Gm-Message-State: AOJu0YxNAu98PZs5JdeONF9O9G7HJgoIpYhOQQotw77HAHZzhF8QosI2
	XoYJAyLffmyxkiXdZdbfaiVt1sjnHODDRt33rh3qDI79WNJ3sXgI0Xxn7DiL9izhZYzxM4TbIFK
	buSVLcJT1+AgbhgZQdPR7Vo/jtx8=
X-Google-Smtp-Source: AGHT+IFDhMroGBcIugrFh5ZmmzBe/63nbTL/N+2PUYOiLt/O7wlBsRhdH24OaHMpbCqKImjBZmctfDVP7eBmfnaMZrQ=
X-Received: by 2002:a05:6402:529b:b0:5c9:5a96:2869 with SMTP id
 4fb4d7f45d1cf-5cbbf8b1ceamr191791a12.10.1729881387104; Fri, 25 Oct 2024
 11:36:27 -0700 (PDT)
MIME-Version: 1.0
References: <760237a3-69d6-9197-432d-0306d52c048a@google.com>
 <3A1E5353-D8C5-4D38-A3FF-BFC671FC25CE@nvidia.com> <966a4aff-f587-c4bb-1e10-2673734c2aa0@google.com>
 <E5A75697-55C7-4335-8D86-EE5CB6A99C4F@nvidia.com>
In-Reply-To: <E5A75697-55C7-4335-8D86-EE5CB6A99C4F@nvidia.com>
From: Yang Shi <shy828301@gmail.com>
Date: Fri, 25 Oct 2024 11:36:15 -0700
Message-ID: <CAHbLzkqEhbw89HMh7h-r6M1xM5vw7bUZDNO7KEgPyaguO32d-Q@mail.gmail.com>
Subject: Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
To: Zi Yan <ziy@nvidia.com>
Cc: Hugh Dickins <hughd@google.com>, Andrew Morton <akpm@linux-foundation.org>, 
	Usama Arif <usamaarif642@gmail.com>, Wei Yang <richard.weiyang@gmail.com>, 
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Matthew Wilcox <willy@infradead.org>, 
	David Hildenbrand <david@redhat.com>, Johannes Weiner <hannes@cmpxchg.org>, 
	Baolin Wang <baolin.wang@linux.alibaba.com>, Barry Song <baohua@kernel.org>, 
	Kefeng Wang <wangkefeng.wang@huawei.com>, Ryan Roberts <ryan.roberts@arm.com>, 
	Nhat Pham <nphamcs@gmail.com>, Chris Li <chrisl@kernel.org>, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Stat-Signature: cxisrmu16tidbirx31fnjwraojnctcnu
X-Rspamd-Queue-Id: D8B3FA0020
X-Rspamd-Server: rspam11
X-HE-Tag: 1729881368-784091
X-HE-Meta: U2FsdGVkX1/x31JlvWioJVllMUudQOCsZUAJpKuA972CTpXqs0otGnxF1Yrkd2ykRdsX9U2KXNx3lIO2DznxSvZSULE7oax8bHCJuHSm3MqqSI3Hgawtt5Ds3Af2wS5VK8IUecizy6uBl7c7NEIGS//TYFXyKX+Z+B/2gPsd9Qiis/Cpggju+2//eaVBDQjlpfAPNjUe/IuFi5RUpxBgx78Zf83expJlIYCVNMCShQjE7czlTpIY9Bal4x+lUBKKcdFq/XMUvHiOZD33O2+/+Ctt2kRc8Bk8Hjfhznf/s18HRKpLaxMi/dJukDvMxgB1aTchRTD4DhJsWjDY9gPV+Q9IXldmCKgLMVgALzX7SDaYi2iD4Ceug4gO4tJ9N9t/qL3wc+fLvobhD5Et0ds5/ylgIIri3X12j5dq+0z4eeM6F8gcdMKJcKrOGwbD2BPO0MO9HC7QP5gaczZrkFkTlKLR2sdnZuM7gidpFhd9z9vXqj6J8BE4mKsPSWAwU0Ol1MPaCIjHwK4hcuN3v3FDDm1g2pAsG7r+mJBaqn49gzEGWq75p/0kpJ+LpOQaNQEbzxukbim4EqeEENVdxsyysBgDfAUdgAAtazZ3c6Uhqt9cn94Qb238oXBV+x1R+9R0ed8lJcx4eWmGkjRih1TBVauSBoa1mgMa7OrkDnrL4BnOb/l7r2uKHJl9U2BdhTtmw+7neL8SoQMsTNEsrTWnnSXyHYgTBJQz1fUNnlfNrncbCUOjx0Ovl1ypyPLsWGXPiI8rY3wX1l6rlY0eCkYxWkNZkWH79qyFWQ0b7l2XV3U2O1il3tzxDmcw8wxLTK0BqNSnG4g0Vi5c80vMRLQ0TiuNYwmB9D0DxrbcswJv3RTPfixep79V5SocxPWNr/NJlUEqZZQWT/Z+b4VPxQjYUXBO07QELgibTWDa4fIGhQMEWPSAj9ubuiq8L31zqvbf4Emn9mcusJ8rEDglxsd
 D1nlvhZH
 G4rcgkRw91E5BwEwxt719PscIcw7DCWpgRCD+bC2no+C4TnZOmZArVhIC1AyJAp0KAEb1U59RCZ/0IwYKLiAq86XcMmtJ30zSMYT1i94apxRTb3wbToLLRQZRKG3HyrsRyVXyQ9l8BsgzrRksYOTK7Yy/S85pc47+nTnkeXBEy7yYm35//Vghefib0O1n/zAYFOEYRmsaRqfde/i5vWCy16De5SmVVYvLaQW5VhYnm+ZWsNTtwi8OrWtcaoSJnhstFbgjpcIDuKVGbz5oQh2Bp6OKPJSQKauVDfp/zkIX9GGwDoEgNENFqyxpq2yOQDatSqqJS7BggFzfMJIdeguMCqbmF596uabEEquBmKna49PDEshfSYt5mMqQBOBV3AxVgItAiIf3qZnCeDNVmHaRT8gKEaaB54QdYuEszm1Fsyai/TjvEwGsQ0NaTGlnAkgoL38I3gmj2ciqXT/aDKCF03TVRw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Oct 25, 2024 at 8:32=E2=80=AFAM Zi Yan <ziy@nvidia.com> wrote:
>
> On 25 Oct 2024, at 1:41, Hugh Dickins wrote:
>
> > On Thu, 24 Oct 2024, Zi Yan wrote:
> >> On 24 Oct 2024, at 0:10, Hugh Dickins wrote:
> >>
> >>> The new unlocked list_del_init() in deferred_split_scan() is buggy.
> >>> I gave bad advice, it looks plausible since that's a local on-stack
> >>> list, but the fact is that it can race with a third party freeing or
> >>> migrating the preceding folio (properly unqueueing it with refcount 0
> >>> while holding split_queue_lock), thereby corrupting the list linkage.
> >>>
> >>> The obvious answer would be to take split_queue_lock there: but it ha=
s
> >>> a long history of contention, so I'm reluctant to add to that. Instea=
d,
> >>> make sure that there is always one safe (raised refcount) folio befor=
e,
> >>> by delaying its folio_put().  (And of course I was wrong to suggest
> >>> updating split_queue_len without the lock: leave that until the splic=
e.)
> >>
> >> I feel like this is not the right approach, since it breaks the existi=
ng
> >> condition of changing folio->_deferred_list, namely taking
> >> ds_queue->split_queue_lock for serialization. The contention might not=
 be
> >> as high as you think, since if a folio were split, the split_queue_loc=
k
> >> needed to be taken during split anyway. So the worse case is the same
> >> as all folios are split. Do you see significant perf degradation due t=
o
> >> taking the lock when doing list_del_init()?
> >>
> >> I am afraid if we take this route, we might hit hard-to-debug bugs
> >> in the future when someone touches the code.
> >
> > You have a good point: I am adding another element of trickiness
> > to that already-tricky local-but-not-quite list - which has tripped
> > us up a few times in the past.
> >
> > I do still feel that this solution is right in the spirit of that list;
> > but I've certainly not done any performance measurement to justify it,
> > nor would I ever trust my skill to do so.  I just tried to solve the
> > corruptions in what I thought was the best way.
> >
> > (To be honest, I found this solution to the corruptions first, and thou=
ght
> > the bug went back to the original implemention: that its put_page() at =
the
> > end of the loop was premature all along.  It was only when writing the
> > commit message two days ago, that I came to realize that even put_page(=
)
> > or folio_put() would be safely using the lock to unqueue: that it is on=
ly
> > this new list_del_init() which is the exception which introduces the bu=
g.)
> >
> > Looking at vmstats, I'm coming to believe that the performance advantag=
e
> > of this way is likely to be in the noise: that mTHPs alone, and the
> > !partially_mapped case on top, are greatly increasing the split_deferre=
d
> > stats: and may give rise to renewed complaints of lock contention, with
> > or without this optimization.
> >
> > While I still prefer to stick with what's posted and most tested, I am
> > giving the locked version a run overnight.  Thanks a lot for the review=
s
> > and acks everyone: at present Zi Yan is in the minority preferring a
> > locked version, but please feel free to change your vote if you wish.
>
> Thank you a lot for taking the time to check the locked version. Looking
> forward to the result. BTW, I am not going to block this patch since it
> fixes the bug.
>
> The tricky part in deferred_list_scan() is always the use of
> folio->_deferred_list without taking split_queue_lock. I am thinking abou=
t
> use folio_batch to store the out-of-split_queue folios, so that _deferred=
_list
> will not be touched when these folios are tried to be split. Basically,
>
> 1. loop through split_queue and move folios to a folio_batch until the
>    folio_batch is full;
> 2. loop through the folio_batch to try to split each folio;
> 3. move the remaining folios back to split_queue.
>
> With this approach, split_queue_lock might be taken more if there are
> more than 31 (folio_batch max size) folios on split_queue and split_queue=
_lock
> will be held longer in step 3, since the remaining folios need to be
> added back to split_queue one by one instead of a single list splice.

IMHO, the folio_batch approach is worth trying. The deferred list lock
is just held when deleting folio from deferred list and updating the
list len. Re-acquiring the lock every 31 folios seems not very bad. Of
course, some benchmark is needed.

The other subtle thing is folio->_deferred_list is reused when the
folio is moved to the local on-stack list. And some
list_empty(deferred_list) checks return true even though the folio is
actually on the local on-stack list. Some code may depend on or
inadvertently depend on this behavior. Using folio_batch may break
some assumptions, but depending on this subtle behavior is definitely
not reliable IMHO.

>
> Let me know your thoughts. I can look into this if this approach sounds
> promising. Thanks.
>
>
> Best Regards,
> Yan, Zi