From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4FCAE77188 for ; Mon, 30 Dec 2024 11:55:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 18A2F6B0093; Mon, 30 Dec 2024 06:55:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 13AF76B0095; Mon, 30 Dec 2024 06:55:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 001AC6B0096; Mon, 30 Dec 2024 06:55:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D002F6B0093 for ; Mon, 30 Dec 2024 06:55:09 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 596911A0555 for ; Mon, 30 Dec 2024 11:55:09 +0000 (UTC) X-FDA: 82951467648.27.71C9340 Received: from mail-vk1-f176.google.com (mail-vk1-f176.google.com [209.85.221.176]) by imf22.hostedemail.com (Postfix) with ESMTP id 3CBC3C0008 for ; Mon, 30 Dec 2024 11:54:19 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Um1I2eCT; spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.176 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735559663; a=rsa-sha256; cv=none; b=y8Z33HGS6rt3C5BDZVinJKG7ExSJd+HYZ8V2rTlZdCAS3beWq1BtFZMZCYye4nyY6dHj/n Q0t/n4QNcyruGPvNqntmfD/305NLgi1dcaiRZBlXgXsbPXqsE3njAU+RuFqbghysbI+Zy3 aaliohpik/0+hg7leQipc4qqvVEEl88= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Um1I2eCT; spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.176 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735559663; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hAONH/Bli+bWgbRPHtGBKM6Jd9y+cmxVWOsSnP2Hl/k=; b=wKST5fal6TVXBwUlxr82NVRbPUpoSC6VFI5qdIJKyt1k92v3fHkBBURyjB1fBhYfm/E//j Y2vijshxXcVn6U6aGPZsjzcUFhiu07HWBhGObtFKZP8KTVzC+e/9qV7maCF+DbD6eVVCKJ F5rwNT+gxMWlNizk1a3cRMBEtSwyHLw= Received: by mail-vk1-f176.google.com with SMTP id 71dfb90a1353d-5187f0b893dso2903314e0c.3 for ; Mon, 30 Dec 2024 03:55:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1735559706; x=1736164506; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hAONH/Bli+bWgbRPHtGBKM6Jd9y+cmxVWOsSnP2Hl/k=; b=Um1I2eCTCqnxbiFS/WVBh+ptAOZzfpMFXRevhX95/QX22H0CbjWYTIuy24OVtmU5fn 2bmqjsu5M0GQ3Z7NiHvnMjyDnPrWmyKzjXIKEAO5seZ1FGdK5/zKLjE8T2xKUu2eNCP1 zwcLO4hegUXNkMtBshbvsqaszv8GMJ630s7WFg7uLqR1Ttj7E1oY/ydCKC0zmQdGw79x S6fIIe9P2rScEwqlitb8SV/KKXi8U+HiIJkzoF26YXWXkvxyzQicCP8uDVTSD1BRWm9I WO+gBaRF80nG52Oulc66tx3xELJYmo5W06exuLTZ406kpsgG01wLvMoOrxdd1L9ypoiv X85Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735559706; x=1736164506; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hAONH/Bli+bWgbRPHtGBKM6Jd9y+cmxVWOsSnP2Hl/k=; b=bWnd8brNyOPMsch3HN4n6JzZymEfdobSutrGXqy6gHfAzPmpR7D+uMFlFW/A67nOm9 t8HeO7kLv83uUCOaH5SJUocpeCB9YY1NUIdjpw4KikYAIDAWOexXFbX2V0Ti+Ycs07jT iV7kFuIiizqF+7anSzc6FkfUhuhYmQ3k1fApyHNPEkrKFAwHo4WnhQo5JfrgrDruolB9 8/GJzRcEtgX/Iy2fe5gxYswe09qnDWKakrmagyLEjjy2n3qwiuQ9ZMHpylhH3M+psNPN osj/KC8ZWi7yaYKbIETs7AwE1Va+N4s+9CHORrY2NylVa/LcLycU5SQQHTsUPOC27lxu bjrA== X-Forwarded-Encrypted: i=1; AJvYcCUkRiuaHMrZaL1R0JbpzUrphN+swAcza1x7ujel8IyDNfiBo4TJmSQVutVE4YzH8VDXs65vTHyG4Q==@kvack.org X-Gm-Message-State: AOJu0Yy8igkufR8oi0cfrirbVCHAKM9zry2msoe0HeZWnP02AWNS/7cE TY2duhsRgnl5Jh+ukH8yXgry8gdFNWlbV6NjVCiUrlo5QaksGpG036a5HqeWJN0WGUS0G9aV/Gq pfgUwhkhugHhWyUpbzTa87ctXlbE= X-Gm-Gg: ASbGnctslE5StDILSOTqKThtVLtHWFMmpRdGymy4ZMEMgSQKI9h7enqYauyK2z84wZ5 f4hHDXralbv0km1e+bhRN00BnCathapT9T3dwsE6zB8RitweKhoxeLTOzivpJsfe0qTwZVISD X-Google-Smtp-Source: AGHT+IGHsSS3b1npK+QrfsliL1ck3ZxFqd1NoHLejBzroNf61PWorXnvyq7Pch3Xl/JPq3oXfZfeZzx/LtC49ZJFzsU= X-Received: by 2002:a05:6122:1816:b0:517:4fb0:749c with SMTP id 71dfb90a1353d-51b75c42ddbmr26649619e0c.3.1735559706474; Mon, 30 Dec 2024 03:55:06 -0800 (PST) MIME-Version: 1.0 References: <142a47b6-ac31-465c-917e-7b2e98fddb2f@redhat.com> In-Reply-To: <142a47b6-ac31-465c-917e-7b2e98fddb2f@redhat.com> From: Barry Song <21cnbao@gmail.com> Date: Tue, 31 Dec 2024 00:54:55 +1300 Message-ID: Subject: Re: All MADV_FREE mTHPs are fully subjected to deferred_split_folio() To: David Hildenbrand Cc: Lance Yang , Linux-MM , Ryan Roberts , Baolin Wang , Andrew Morton Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 3CBC3C0008 X-Stat-Signature: ao7memotjrewc5fx8pdpm1gr57c5u664 X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1735559659-128813 X-HE-Meta: U2FsdGVkX18B3Md3/1h7qhLZUSGN4K76KzjvQODV/VHuG0grBIGvxM9aJRlfumHZNHGWDQLyUTx14aJrCM7QODnDjCZtRvgFqIZpm9+/TKywwtPbit65OqqeQLCAqinrTOoOyEBMZFvOz2ON3omJuJ9MaedfvXvRyWZNasPtptg08kYfi2id7qD4YTT1RinDsfrV5D5/IoVP7Zw4F3tqm2MBtCUyhHQxxC7NOQpB1tUCcCythJPQZdABg3UF5CLBeceLDJmD/mV5WRpyPyUEkExnFuKsfebPFhmojoet2tZLETDQoITNHKT8DhMDBMxQsenlXGBgSo5fGpLhby8rusU4fL8sTGKFn5yUo6YOf/YEf1CM76OP2AmmpbfWOE66EUEqs2kRWGxPnyb/Ns99We+WPKdp7Sq4yR/s+H8GYv3zWA8N7/mDelmwru/7QgbgMx+fQyIOVOmTrX4zfQZcGmPaEbotDTZ8WOdZcMr61Jdp+jo/OVK7yT/TEbh3rkiO261Uyi5gjzZfn3wcQmVhFpGNE2q7mOiV/UEHYrH3jzbE1Lq1Ng7tRK5McKn1PYY1RA7LcL1JwitvKnblYB5N23F2dZEUs7Apvc1ItkSeKIPzHllJs1tGx769TN3jfTSx5gNQBbvyQpf1tcVKZNyGkURy9j7LhF6SolrrA089QOgzwpBOnHx6CtMPrk72+HhTPnTWPNHpDPwwuXWy2ll2W060Bhu1DvVDAB/K1Y9i4hBZRf3zLtL0/kEhMvqvWDU9M3fsAMUzGgvY4XA9hrY16R238GR+HNVjfHjvpnvVu2gq8lRHly50+qGIueDwNcOT3hBE7Zj1Pka+qvseNM/zBGiPgeiz4h2ShQV4CrRj2ktmeR7jD2XlFdSg9EcJMTV2aJVvU+Zm2HVfvWma6Hf6FJ+8lb9Kw4PmBVqB70i6hEcg5SqM18ofc+FJf3EzdQV5uPZbk5kdoe3qZrEzg9h BwNUQ8Qu tzvKeR3L8IUqEP6GETuGngz8NQP88/6zN+iIqZ9Oyw9taeNLQGKwuGmnSAX/jqUzv34ZTPX4cXPgq1CeOWAf+xwW2l6udJ53KdWifI7/STLs8diPbO26R74v/FvU1pR/0IXMfbxCr6u5/N2qyr8PFwZuDwj86uiviVKbW7pUELFrGZ4pZTw5onIbYKtu9bbl/LIe9wZTwRBLT3efwBWRX7NDnQuNnxGSIL1vHUmz1gJIOoSsd1XxMfG7UB/hVbctWoThnir0zPoLaJrqbWMPD6Fa6QpdUq+O/O+0LY6shcvOfx0jTeHClSWCxqazChIkwggAxxT6tCwhXClW73dwAdEFcNt3V0/Mk4BfU8CoCu1btl2cB+CKu0FZ9M98NGF/JHo8megWZLfdW0bUdhbNHzMkt2FgbOFGwcxSs X-Bogosity: Ham, tests=bogofilter, spamicity=0.000070, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Dec 30, 2024 at 10:48=E2=80=AFPM David Hildenbrand wrote: > > On 30.12.24 03:14, Lance Yang wrote: > > Hi Barry, > > > > On Mon, Dec 30, 2024 at 5:13=E2=80=AFAM Barry Song <21cnbao@gmail.com> = wrote: > >> > >> Hi Lance, > >> > >> Along with Ryan, David, Baolin, and anyone else who might be intereste= d, > >> > >> We=E2=80=99ve noticed an unexpectedly high number of deferred splits. = The root > >> cause appears to be the changes introduced in commit dce7d10be4bbd3 > >> ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free"). Since > >> that commit, split_folio is no longer called in mm/madvise.c. > > Hi, > > I assume you don't see "deferred splits" at all. You see that a folio > was added to the deferred split queue to immediately be removed again as > it gets freed. Correct? > > >> > >> However, we are still performing deferred_split_folio for all > >> MADV_FREE mTHPs, even for those that are fully aligned with mTHP. > >> This happens because we execute a goto discard in > >> try_to_unmap_one(), which eventually leads to > >> folio_remove_rmap_pte() adding all folios to deferred_split when we > >> scan the 1st pte in try_to_unmap_one(). > >> > >> discard: > >> if (unlikely(folio_test_hugetlb(folio))) > >> hugetlb_remove_rmap(folio); > >> else > >> folio_remove_rmap_pte(folio, subpage, vma); > > Yes, that's kind-of know: we neither do PTE batching during unmap for > reclaim nor during unmap for migration. We should add that support. > > But note, just like I raised earlier in the context of similar to > "improved partial-mapped logic in rmap code when batching", we are > primarily only pleasing counters here. > > See below on concurrent shrinker. > > >> > >> This could lead to a race condition with shrinker - deferred_split_sca= n(). > >> The shrinker might call folio_try_get(folio), and while we are scannin= g > >> the second PTE of this folio in try_to_unmap_one(), the entire mTHP > >> could be transitioned back to swap-backed because the reference count > >> is incremented. > >> > >> /* > >> * The only page refs must be one fro= m isolation > >> * plus the rmap(s) (dropped by disca= rd:). > >> */ > >> if (ref_count =3D=3D 1 + map_count && > >> (!folio_test_dirty(folio) || > >> ... > >> (vma->vm_flags & VM_DROPPABLE)))= { > >> dec_mm_counter(mm, MM_ANONPAG= ES); > >> goto discard; > >> } > > > Reclaim code holds an additional folio reference and has the folio > locked. So I don't think this race can really happen in the way you > think it could? Please feel free to correct me if I am wrong. try_to_unmap_one will only execute "goto discard" and remove the rmap if ref_count =3D=3D 1 + map_count. An additional ref_count + 1 from the shrink= er can invalidate this condition, leading to the restoration of the PTE and se= tting the folio as swap-backed. /* * The only page refs must be one from isol= ation * plus the rmap(s) (dropped by discard:). */ if (ref_count =3D=3D 1 + map_count && (!folio_test_dirty(folio) || /* * Unlike MADV_FREE mappings, VM_DROPP= ABLE * ones can be dropped even if they've * been dirtied. */ (vma->vm_flags & VM_DROPPABLE))) { dec_mm_counter(mm, MM_ANONPAGES); goto discard; } /* * If the folio was redirtied, it cannot be * discarded. Remap the page to page table. */ set_pte_at(mm, address, pvmw.pte, pteval); /* * Unlike MADV_FREE mappings, VM_DROPPABLE = ones * never get swap backed on failure to drop= . */ if (!(vma->vm_flags & VM_DROPPABLE)) folio_set_swapbacked(folio); goto walk_abort; > > >> > >> It also significantly increases contention on ds_queue->split_queue_lo= ck during > >> memory reclamation and could potentially introduce other race conditio= ns with > >> shrinker as well. > > > > Good catch! > > > > Call me "skeptical" that this is a big issue, at least regarding the > shrinker, but also regarding actual lock contention. :) > > The issue might be less severe than you think: mostly pleasing counters. > But yes, there is room for improvement. > > >> > >> I=E2=80=99m curious if anyone has suggestions for resolving this issue= . My > >> idea is to use > >> folio_remove_rmap_ptes to drop all PTEs at once, rather than > >> folio_remove_rmap_pte, > >> which processes PTEs one by one for an mTHP. This approach would requi= re some > >> changes, such as checking the dirty state of PTEs and performing a TLB > >> flush for the > >> entire mTHP as a whole in try_to_unmap_one(). > > > > Yeah, IHMO, it would also be beneficial to reclaim entire mTHPs as a wh= ole > > in real-world scenarios where MADV_FREE mTHPs are typically no longer > > written ;) > > > We should be implementing folio batching. But it won't be able to cover > all cases. > > In the future, I envision that during reclaim/access bit scanning, we > determine whether a folio is partially mapped and add it to the deferred > split queue. That's one requirement for getting rid of > folio->_nr_page_mapped and reliably detecting all partial mappings, but > it also avoids having to messing with this information whenever we > (temporarily) unmap only parts of a folio, like we have here. > > Thanks! > > -- > Cheers, > > David / dhildenb >