From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 872A8E77188 for ; Mon, 30 Dec 2024 16:03:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 68D7A6B007B; Mon, 30 Dec 2024 11:03:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 63C2B6B0083; Mon, 30 Dec 2024 11:03:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 504446B0085; Mon, 30 Dec 2024 11:03:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 31D826B007B for ; Mon, 30 Dec 2024 11:03:01 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id B54B045274 for ; Mon, 30 Dec 2024 16:03:00 +0000 (UTC) X-FDA: 82952093070.05.4B209A7 Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by imf14.hostedemail.com (Postfix) with ESMTP id CFD6C10002D for ; Mon, 30 Dec 2024 16:02:06 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=RH7GrAn8; spf=pass (imf14.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735574546; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nZtZC46Pto87vZKgdetESMTwMUzkMCaM1sGofVJQJak=; b=R3VRnaPt9Fz3mv4Tuh/7HNArOS3FJAcpe/YHMSzYQJ3hJGpk66DDiIiEM3Yn8YECMfO/q+ Jf6j44pvcDRKRrzdmXJzst3d2wDp3CbvjtHWyffDQ6+PyNAD9m/oH4J9UwMzaZ0I2Jlrd1 1VDYXb28hB9UYASuGYsPMlfWGQ4gkro= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=RH7GrAn8; spf=pass (imf14.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735574546; a=rsa-sha256; cv=none; b=SgG7h7EW4PQ9Vp2q6xW66bNGp+afn5Pv4hL8ThyYERUkZKphf5Dy3fSgAMzQSBseppHm7F c3TwsnFxzZQCYEv6ATzuhSomr2ULpts3YxkDwd86C2rZrvIiJesBTZt2bLiVelM9rQB+iX D+wQUJw8xvea6PIBsEy6cDpKGeFly4U= Received: by mail-ed1-f52.google.com with SMTP id 4fb4d7f45d1cf-5d0ac27b412so12309678a12.1 for ; Mon, 30 Dec 2024 08:02:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1735574577; x=1736179377; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nZtZC46Pto87vZKgdetESMTwMUzkMCaM1sGofVJQJak=; b=RH7GrAn8ouG6FXjpVd4Pv1mMJMp22F0IJ6yOpZROm5NLg7yFsB+d6x5XJerHUqXf7u xpI5hh5tObCjIpnCdjhJ4Qbx5Oa1LtJ8R5j+KOhGOM2GdTLm8I3S4EJx+Tm6YxyR1YbN 5DOgsBXNswH1aL1GurFYO9eYFfioVyz2hQcMdJ1XmK05GnMcXrsCDemBBYd79X2agQlT QpBic67sSdWy0W4fdE9ONXGE+nsEY8MUOoqubDfst/pzbB5ELE1Ju3JydGYbml9upD5Q ev6n5eVdFMuH17PjiyBYcpI24oAZIyMKCdPkLITe/P78dIIk6vPkusibNyXy+LBFCgYX NEFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735574577; x=1736179377; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nZtZC46Pto87vZKgdetESMTwMUzkMCaM1sGofVJQJak=; b=r5NrTtctaRxi0tePDihlg0DnEeb0huI41MbATlwGSUkXU/UA8+F9dMSRZsUB6/r7Ho 6s29YCo/uGhmeRTvcY1QAWoM/FoLDT2B/Vj6NNqtjN93zRr7Yu95FD04e1nm9prEUbA8 vF7tpN4bsYRGYC90vKCS1MmFOC7gtUmoGLZrXABB0Pn8LXdciog5LDXpt/+WFCsu8Dzb DJJUBiRJrb7WTKuru3YIUcIxVQBYJ4gpyffl6zaHT8p0NSDHA75M6vNMyjtQ6x3JTBWS 6nmh9Kqhgi6LIEcgUqXzBz1VAsOVmHFmM0CBGHyJLFFwZeRSbNvFSAuyJSfT8jln122f 6gLw== X-Forwarded-Encrypted: i=1; AJvYcCUnyL1pfp0Zs1gF4Zn06+WmUP57b+JXEFanx7khKBtmOpobEeFzFTDNkwDy3eAaRx6eL8ybpT4Yqw==@kvack.org X-Gm-Message-State: AOJu0YyuIINabOjRCxVkImtGLb8vzl8f585d2AqxsyBLyIxx2ocaKVhx Dl3cvvnY6R3hm/QGA0K2KgOhSgdUzUszE6mFhcbovjRobo1yuKAQHd5jT9pr5TsfvJY+lCW9P5V b/tiKOpyOA+rGVCHI74UD3ZdvY8U= X-Gm-Gg: ASbGnctH1QosMR2oXxpzj8ZBFxEm3IcnlhWWwISm7rFbBFsBWEQ/EYJ4QKRGMH4u1hh Z/Qhag0/ZOK2UNUhWBYiOPzbmHrMXhIULEFTy1EFK X-Google-Smtp-Source: AGHT+IG7TDxt8tqw7/zujg09POaYOkvEsMsrhqXa8wPsxtYKCFrfzx6hSEQZ2T2iSZU2iQwb+jlO7A/CdFvoR+11W54= X-Received: by 2002:a05:6402:524d:b0:5d2:723c:a57e with SMTP id 4fb4d7f45d1cf-5d81ddacfeemr33714982a12.16.1735574576740; Mon, 30 Dec 2024 08:02:56 -0800 (PST) MIME-Version: 1.0 References: <142a47b6-ac31-465c-917e-7b2e98fddb2f@redhat.com> <8690de27-a1be-4440-a2d6-1a5cc56dcceb@redhat.com> In-Reply-To: <8690de27-a1be-4440-a2d6-1a5cc56dcceb@redhat.com> From: Lance Yang Date: Tue, 31 Dec 2024 00:02:20 +0800 Message-ID: Subject: Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio() To: David Hildenbrand Cc: Barry Song <21cnbao@gmail.com>, Linux-MM , Ryan Roberts , Baolin Wang , Andrew Morton Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Stat-Signature: ohhgr9njmuu1uhtznf7p3r5hfmps9j4j X-Rspamd-Queue-Id: CFD6C10002D X-Rspam-User: X-HE-Tag: 1735574526-321760 X-HE-Meta: U2FsdGVkX1/H0InX1NAezpEnvZbnc4qMJ8hpB+eyXJ4FPDksQFVjxvqB001jp3wplbyTnQXamzzCJLGF+GuDgeXd5WFEWsLGO2kp36QsoZhFA4uoW2aFZ4T4TZMrEnbg2AwjuFYVjYDt+7kkJj3ulOu3DJq7aBTvcQruimNfZWE9o8TnMzDws6eDUIEaz3PJVjeTbGerCMmjp4jLS0P/mzraqoamBn5bkosyMt+9BPgbHNfGk8XHScJJdTCRk0eFnOURMyZzygmyF0JdCB3JqgVABQOxY7YTAivMr9NrYWvrUKfsl89YGf6yImSPZC2NHDC8i8XBX9ec4SOo1+oe3amPgGXL+eKfUHyxDnlnRX1har1KOlQ0qbP3JtrOoKxAfl2hVXAwtzpfkAbZnQlJxSRyoHm46D0LenJYxqWrPHsWyuhd0WBO+KMTQ2TSI0d7tlUKu0SZ5WW3pt6dSMVnjFTKI3b3jXAclMxvp92Q1tHQfbfgwXOqTDbwGB6zb/kI2eQXBVxlQjuZTkZSiqplSgDd44VgM8SXSz0B6LDuH3qJ9+E1xFR2HMGULdIx8A2llI6NeukNly6NqkmVcxXOVphnNWSkPt1u+lLsu6YWRthp4J8mH9WiFxchkGqUlmeUQtsBpXr3iJUMkKYjOeHZ/KZmI2imDfCdbA8jCaPjmLeN7y2k63svLQS3EIwIZRD/vYh4Vlg2Ao2PUU0IAcQtZ/qK2zNbaVnRotDm4hBCRBOKyZP4CoRwizr4QsPaDRTgS1K9fAjSp6miGjZPBMmuYxFFRXpV/PMUaRklhVbT7WeBkGZ8s0KIw/oRC6giGq5v1AeHe1k4tNO4mKzHVKn/iuH/kjN8V/KvHXRar0FPhQVjE7rTJdVfJf0Bwfqnm3Q3mwMGVsFQBBO9dL+EjRbrWB3tdvVRE2XpdWsLv33Bgj9dNKIGMS+t/q93F9Ecdj0QyxboBDTD2Ke5gnK0ex2 RSySwg02 F3TYJWbXpPb/amWpuzkDzBqj8nEA1dNQK4r+zCOd5FknSQqwOBiBBGuXD5JyqwXPHbs9bGJc0wgmDaBbnb1hyBO6Z9JzcMpY55Hn4J09agbm86Z6I8RvPwXmJkGKivPw4Fvtct8VKxb71fO2u98Bdy53hTsALV8ozzQZDTjK/mpo7zVfxvQBtzq4BbfUh8dRnpcnMj2HMcoKxaiP/gQQ1cW+6Ck+lPp28PoG5StT+Adf24kDboaWFnJM9ZPzcPuFNrp3RMO/3ALtg8Ji/1pU+SA1LwNxzBRkCiJ7gVwIxCviJ3Qa87vx+HTp2XB4f1svvV4EnlYZRpxlyT+4jLKDFpZMU9pqfMCNFvBoPPAgvpjLDQVBGoXBviMOniDRJN0kVHJLhCD+CsthrZBvr8ZS2iAbXmvFjD8GjPg4C X-Bogosity: Ham, tests=bogofilter, spamicity=0.000358, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Dec 30, 2024 at 8:52=E2=80=AFPM David Hildenbrand wrote: > > On 30.12.24 12:54, Barry Song wrote: > > On Mon, Dec 30, 2024 at 10:48=E2=80=AFPM David Hildenbrand wrote: > >> > >> On 30.12.24 03:14, Lance Yang wrote: > >>> Hi Barry, > >>> > >>> On Mon, Dec 30, 2024 at 5:13=E2=80=AFAM Barry Song <21cnbao@gmail.com= > wrote: > >>>> > >>>> Hi Lance, > >>>> > >>>> Along with Ryan, David, Baolin, and anyone else who might be interes= ted, > >>>> > >>>> We=E2=80=99ve noticed an unexpectedly high number of deferred splits= . The root > >>>> cause appears to be the changes introduced in commit dce7d10be4bbd3 > >>>> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Sinc= e > >>>> that commit, split_folio is no longer called in mm/madvise.c. > >> > >> Hi, > >> > >> I assume you don't see "deferred splits" at all. You see that a folio > >> was added to the deferred split queue to immediately be removed again = as > >> it gets freed. Correct? > >> > >>>> > >>>> However, we are still performing deferred_split_folio for all > >>>> MADV_FREE mTHPs, even for those that are fully aligned with mTHP. > >>>> This happens because we execute a goto discard in > >>>> try_to_unmap_one(), which eventually leads to > >>>> folio_remove_rmap_pte() adding all folios to deferred_split when we > >>>> scan the 1st pte in try_to_unmap_one(). > >>>> > >>>> discard: > >>>> if (unlikely(folio_test_hugetlb(folio))) > >>>> hugetlb_remove_rmap(folio); > >>>> else > >>>> folio_remove_rmap_pte(folio, subpage, vma)= ; > >> > >> Yes, that's kind-of know: we neither do PTE batching during unmap for > >> reclaim nor during unmap for migration. We should add that support. > >> > >> But note, just like I raised earlier in the context of similar to > >> "improved partial-mapped logic in rmap code when batching", we are > >> primarily only pleasing counters here. > >> > >> See below on concurrent shrinker. > >> > >>>> > >>>> This could lead to a race condition with shrinker - deferred_split_s= can(). > >>>> The shrinker might call folio_try_get(folio), and while we are scann= ing > >>>> the second PTE of this folio in try_to_unmap_one(), the entire mTHP > >>>> could be transitioned back to swap-backed because the reference coun= t > >>>> is incremented. > >>>> > >>>> /* > >>>> * The only page refs must be one = from isolation > >>>> * plus the rmap(s) (dropped by di= scard:). > >>>> */ > >>>> if (ref_count =3D=3D 1 + map_count= && > >>>> (!folio_test_dirty(folio) || > >>>> ... > >>>> (vma->vm_flags & VM_DROPPABLE= ))) { > >>>> dec_mm_counter(mm, MM_ANON= PAGES); > >>>> goto discard; > >>>> } > >> > >> > >> Reclaim code holds an additional folio reference and has the folio > >> locked. So I don't think this race can really happen in the way you > >> think it could? Please feel free to correct me if I am wrong. > > > > try_to_unmap_one will only execute "goto discard" and remove the rmap i= f > > ref_count =3D=3D 1 + map_count. An additional ref_count + 1 from the sh= rinker > > can invalidate this condition, leading to the restoration of the PTE an= d setting > > the folio as swap-backed. > > > > /* > > * The only page refs must be one from= isolation > > * plus the rmap(s) (dropped by discar= d:). > > */ > > if (ref_count =3D=3D 1 + map_count && > > (!folio_test_dirty(folio) || > > /* > > * Unlike MADV_FREE mappings, VM_= DROPPABLE > > * ones can be dropped even if th= ey've > > * been dirtied. > > */ > > (vma->vm_flags & VM_DROPPABLE))) = { > > dec_mm_counter(mm, MM_ANONPAGE= S); > > goto discard; > > } > > > > /* > > * If the folio was redirtied, it cann= ot be > > * discarded. Remap the page to page t= able. > > */ > > set_pte_at(mm, address, pvmw.pte, ptev= al); > > /* > > * Unlike MADV_FREE mappings, VM_DROPP= ABLE ones > > * never get swap backed on failure to= drop. > > */ > > if (!(vma->vm_flags & VM_DROPPABLE)) > > folio_set_swapbacked(folio); > > goto walk_abort; > > Ah, that's what you mean. Yes, but the shrinker behaves mostly like just > any other speculative reference. > > So we're not actually handling speculative references here correctly, so > this issue is not completely shrinker-specific. > > Maybe, we should be doing something like this? > > /* > * Unlike MADV_FREE mappings, VM_DROPPABLE ones can be dropped even if > * they've been dirtied. > */ > if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) { > /* > * redirtied either using the page table or a previously > * obtained GUP reference. > */ > set_pte_at(mm, address, pvmw.pte, pteval); > folio_set_swapbacked(folio); > goto walk_abort; > } else if (ref_count !=3D 1 + map_count) { > /* > * Additional reference. Could be a GUP reference or any > * speculative reference. GUP users must mark the folio dirty if > * there was a modification. This folio cannot be reclaimed > * right now either way, so act just like nothing happened. > * We'll come back here later and detect if the folio was > * dirtied when the additional reference is gone. > */ > set_pte_at(mm, address, pvmw.pte, pteval); > goto walk_abort; > } > goto discard; > > > Probably cleaning up goto labels. Nice! It looks simple and easy to understand ;) It might be reasonable to temporarily ignore the folio and check later whether it was marked dirty after the reference is released, as the additional reference could be a GUP or another speculative reference. Personally, I also prefer this approach. Thanks, Lance > > > -- > Cheers, > > David / dhildenb >