From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E509D149E7 for ; Fri, 25 Oct 2024 18:36:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A479A6B0089; Fri, 25 Oct 2024 14:36:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9F70C6B008A; Fri, 25 Oct 2024 14:36:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BEB36B008C; Fri, 25 Oct 2024 14:36:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6E7316B0089 for ; Fri, 25 Oct 2024 14:36:31 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id B9073C0EFA for ; Fri, 25 Oct 2024 18:36:09 +0000 (UTC) X-FDA: 82712979594.22.251C21C Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) by imf15.hostedemail.com (Postfix) with ESMTP id D8B3FA0020 for ; Fri, 25 Oct 2024 18:36:08 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Rk3b3zg9; spf=pass (imf15.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729881336; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FVqk1yJPQA/+HKOpSAlTiulTaUi2W5hnnPyvpTl0kVs=; b=oPc4PgHIlzZkcHYES8adER/VsxgbsrH0HGFg9KlOSMZXt6m1B9qjJxgvOWJvXOE2S/5Yhf rOtwOWX+P1LhPwzLB9sxiB8ni0/xANiwU7T4zMaRDn4iNn9tOgv5jLZFV+6xE2mb419B6U 3opaAfIad0LJfxVnXAgOj97L48Ho3Z8= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Rk3b3zg9; spf=pass (imf15.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729881336; a=rsa-sha256; cv=none; b=Zy2PSyGvjKhHlSmhoV1XZtO4KgmZtNeO45gSN7Hzxo68whaJp6MepKsj1nb3pS9LAaRe09 yPcZjglyjRVkThNNQyal+iwP9cXIrfutXszkM/bvA7y+BxD3tOCcEFE8JPv/21/pZ5/scQ whHVaxX7+HMf/lyZ4AOdFhD1lR5F4tI= Received: by mail-ed1-f51.google.com with SMTP id 4fb4d7f45d1cf-5c903f5bd0eso4501703a12.3 for ; Fri, 25 Oct 2024 11:36:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729881387; x=1730486187; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=FVqk1yJPQA/+HKOpSAlTiulTaUi2W5hnnPyvpTl0kVs=; b=Rk3b3zg9NjXQUU4VUUYsYpCGtrVKhbEt3PryMI+lEWO+bJiZUqotvIVkGYB5QqJN1g IsDI+bkltcE9GVHFE8T+6UWZBqJnG3mSz+hLFUlYKf5AhRTKGKFLHplGBT0ZUxHjDGFe MLcYTLP4tbgJra1z1qVdaGDMyFvHKQWDLNz+vnIG4Ktp9IhjVMD+n1Tlch9J1FYBXwnO ffOHBacz3Uter8xCbtu5MMsSmsA8DXK2m0ftdefKqg10nmpz2iTcaFO7UUrgb646ZcA3 S0O1uDB6BHlIhIpiyNHvehx1BZ4k+e7svaNgDyunkNsYAsmSYOj3r1mW9pBrnNWo/reS mRVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729881387; x=1730486187; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FVqk1yJPQA/+HKOpSAlTiulTaUi2W5hnnPyvpTl0kVs=; b=w91GeiI9qkqAXf2q4Dw5enwXSsic4wU9deFDoRygmEH2eQIJPjoA/xHRDGwtHyNCtn /h1R2ewpgZ1mOSGH8G6YdCc4R7TSja/nOGa8fA3alHxWqp6QGaOUw3KnLBnuq8z/7Y0M tnpRy+zX0U9PuNxnr3suNbjG5Kbkeqi1VJwI7xoGJf6VBBfODuhVAcmNupVidbFUx06C bPTh4Z/HQiPm2oR0L77LjzJa5v/yCreBrmLczNzNgDO9oksQt54GzIGvF0QrtN9QwAoY I+7jK3cQFMEqltcZ/tn6nVnd8O0/FFODuUTRhnZ3+kocCJnMRJLcg4jg6dhO3i4+xsTt QKOg== X-Forwarded-Encrypted: i=1; AJvYcCXgmPHKCgC96lFd09blqBZUmFeLK6vQ3MAD+geGNP/VJsSbodcot5TXptosS2xum7jDZuHHWickdg==@kvack.org X-Gm-Message-State: AOJu0YxNAu98PZs5JdeONF9O9G7HJgoIpYhOQQotw77HAHZzhF8QosI2 XoYJAyLffmyxkiXdZdbfaiVt1sjnHODDRt33rh3qDI79WNJ3sXgI0Xxn7DiL9izhZYzxM4TbIFK buSVLcJT1+AgbhgZQdPR7Vo/jtx8= X-Google-Smtp-Source: AGHT+IFDhMroGBcIugrFh5ZmmzBe/63nbTL/N+2PUYOiLt/O7wlBsRhdH24OaHMpbCqKImjBZmctfDVP7eBmfnaMZrQ= X-Received: by 2002:a05:6402:529b:b0:5c9:5a96:2869 with SMTP id 4fb4d7f45d1cf-5cbbf8b1ceamr191791a12.10.1729881387104; Fri, 25 Oct 2024 11:36:27 -0700 (PDT) MIME-Version: 1.0 References: <760237a3-69d6-9197-432d-0306d52c048a@google.com> <3A1E5353-D8C5-4D38-A3FF-BFC671FC25CE@nvidia.com> <966a4aff-f587-c4bb-1e10-2673734c2aa0@google.com> In-Reply-To: From: Yang Shi Date: Fri, 25 Oct 2024 11:36:15 -0700 Message-ID: Subject: Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped To: Zi Yan Cc: Hugh Dickins , Andrew Morton , Usama Arif , Wei Yang , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Johannes Weiner , Baolin Wang , Barry Song , Kefeng Wang , Ryan Roberts , Nhat Pham , Chris Li , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: cxisrmu16tidbirx31fnjwraojnctcnu X-Rspamd-Queue-Id: D8B3FA0020 X-Rspamd-Server: rspam11 X-HE-Tag: 1729881368-784091 X-HE-Meta: U2FsdGVkX1/x31JlvWioJVllMUudQOCsZUAJpKuA972CTpXqs0otGnxF1Yrkd2ykRdsX9U2KXNx3lIO2DznxSvZSULE7oax8bHCJuHSm3MqqSI3Hgawtt5Ds3Af2wS5VK8IUecizy6uBl7c7NEIGS//TYFXyKX+Z+B/2gPsd9Qiis/Cpggju+2//eaVBDQjlpfAPNjUe/IuFi5RUpxBgx78Zf83expJlIYCVNMCShQjE7czlTpIY9Bal4x+lUBKKcdFq/XMUvHiOZD33O2+/+Ctt2kRc8Bk8Hjfhznf/s18HRKpLaxMi/dJukDvMxgB1aTchRTD4DhJsWjDY9gPV+Q9IXldmCKgLMVgALzX7SDaYi2iD4Ceug4gO4tJ9N9t/qL3wc+fLvobhD5Et0ds5/ylgIIri3X12j5dq+0z4eeM6F8gcdMKJcKrOGwbD2BPO0MO9HC7QP5gaczZrkFkTlKLR2sdnZuM7gidpFhd9z9vXqj6J8BE4mKsPSWAwU0Ol1MPaCIjHwK4hcuN3v3FDDm1g2pAsG7r+mJBaqn49gzEGWq75p/0kpJ+LpOQaNQEbzxukbim4EqeEENVdxsyysBgDfAUdgAAtazZ3c6Uhqt9cn94Qb238oXBV+x1R+9R0ed8lJcx4eWmGkjRih1TBVauSBoa1mgMa7OrkDnrL4BnOb/l7r2uKHJl9U2BdhTtmw+7neL8SoQMsTNEsrTWnnSXyHYgTBJQz1fUNnlfNrncbCUOjx0Ovl1ypyPLsWGXPiI8rY3wX1l6rlY0eCkYxWkNZkWH79qyFWQ0b7l2XV3U2O1il3tzxDmcw8wxLTK0BqNSnG4g0Vi5c80vMRLQ0TiuNYwmB9D0DxrbcswJv3RTPfixep79V5SocxPWNr/NJlUEqZZQWT/Z+b4VPxQjYUXBO07QELgibTWDa4fIGhQMEWPSAj9ubuiq8L31zqvbf4Emn9mcusJ8rEDglxsd D1nlvhZH G4rcgkRw91E5BwEwxt719PscIcw7DCWpgRCD+bC2no+C4TnZOmZArVhIC1AyJAp0KAEb1U59RCZ/0IwYKLiAq86XcMmtJ30zSMYT1i94apxRTb3wbToLLRQZRKG3HyrsRyVXyQ9l8BsgzrRksYOTK7Yy/S85pc47+nTnkeXBEy7yYm35//Vghefib0O1n/zAYFOEYRmsaRqfde/i5vWCy16De5SmVVYvLaQW5VhYnm+ZWsNTtwi8OrWtcaoSJnhstFbgjpcIDuKVGbz5oQh2Bp6OKPJSQKauVDfp/zkIX9GGwDoEgNENFqyxpq2yOQDatSqqJS7BggFzfMJIdeguMCqbmF596uabEEquBmKna49PDEshfSYt5mMqQBOBV3AxVgItAiIf3qZnCeDNVmHaRT8gKEaaB54QdYuEszm1Fsyai/TjvEwGsQ0NaTGlnAkgoL38I3gmj2ciqXT/aDKCF03TVRw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Oct 25, 2024 at 8:32=E2=80=AFAM Zi Yan wrote: > > On 25 Oct 2024, at 1:41, Hugh Dickins wrote: > > > On Thu, 24 Oct 2024, Zi Yan wrote: > >> On 24 Oct 2024, at 0:10, Hugh Dickins wrote: > >> > >>> The new unlocked list_del_init() in deferred_split_scan() is buggy. > >>> I gave bad advice, it looks plausible since that's a local on-stack > >>> list, but the fact is that it can race with a third party freeing or > >>> migrating the preceding folio (properly unqueueing it with refcount 0 > >>> while holding split_queue_lock), thereby corrupting the list linkage. > >>> > >>> The obvious answer would be to take split_queue_lock there: but it ha= s > >>> a long history of contention, so I'm reluctant to add to that. Instea= d, > >>> make sure that there is always one safe (raised refcount) folio befor= e, > >>> by delaying its folio_put(). (And of course I was wrong to suggest > >>> updating split_queue_len without the lock: leave that until the splic= e.) > >> > >> I feel like this is not the right approach, since it breaks the existi= ng > >> condition of changing folio->_deferred_list, namely taking > >> ds_queue->split_queue_lock for serialization. The contention might not= be > >> as high as you think, since if a folio were split, the split_queue_loc= k > >> needed to be taken during split anyway. So the worse case is the same > >> as all folios are split. Do you see significant perf degradation due t= o > >> taking the lock when doing list_del_init()? > >> > >> I am afraid if we take this route, we might hit hard-to-debug bugs > >> in the future when someone touches the code. > > > > You have a good point: I am adding another element of trickiness > > to that already-tricky local-but-not-quite list - which has tripped > > us up a few times in the past. > > > > I do still feel that this solution is right in the spirit of that list; > > but I've certainly not done any performance measurement to justify it, > > nor would I ever trust my skill to do so. I just tried to solve the > > corruptions in what I thought was the best way. > > > > (To be honest, I found this solution to the corruptions first, and thou= ght > > the bug went back to the original implemention: that its put_page() at = the > > end of the loop was premature all along. It was only when writing the > > commit message two days ago, that I came to realize that even put_page(= ) > > or folio_put() would be safely using the lock to unqueue: that it is on= ly > > this new list_del_init() which is the exception which introduces the bu= g.) > > > > Looking at vmstats, I'm coming to believe that the performance advantag= e > > of this way is likely to be in the noise: that mTHPs alone, and the > > !partially_mapped case on top, are greatly increasing the split_deferre= d > > stats: and may give rise to renewed complaints of lock contention, with > > or without this optimization. > > > > While I still prefer to stick with what's posted and most tested, I am > > giving the locked version a run overnight. Thanks a lot for the review= s > > and acks everyone: at present Zi Yan is in the minority preferring a > > locked version, but please feel free to change your vote if you wish. > > Thank you a lot for taking the time to check the locked version. Looking > forward to the result. BTW, I am not going to block this patch since it > fixes the bug. > > The tricky part in deferred_list_scan() is always the use of > folio->_deferred_list without taking split_queue_lock. I am thinking abou= t > use folio_batch to store the out-of-split_queue folios, so that _deferred= _list > will not be touched when these folios are tried to be split. Basically, > > 1. loop through split_queue and move folios to a folio_batch until the > folio_batch is full; > 2. loop through the folio_batch to try to split each folio; > 3. move the remaining folios back to split_queue. > > With this approach, split_queue_lock might be taken more if there are > more than 31 (folio_batch max size) folios on split_queue and split_queue= _lock > will be held longer in step 3, since the remaining folios need to be > added back to split_queue one by one instead of a single list splice. IMHO, the folio_batch approach is worth trying. The deferred list lock is just held when deleting folio from deferred list and updating the list len. Re-acquiring the lock every 31 folios seems not very bad. Of course, some benchmark is needed. The other subtle thing is folio->_deferred_list is reused when the folio is moved to the local on-stack list. And some list_empty(deferred_list) checks return true even though the folio is actually on the local on-stack list. Some code may depend on or inadvertently depend on this behavior. Using folio_batch may break some assumptions, but depending on this subtle behavior is definitely not reliable IMHO. > > Let me know your thoughts. I can look into this if this approach sounds > promising. Thanks. > > > Best Regards, > Yan, Zi