From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3798C54798 for ; Thu, 7 Mar 2024 18:55:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 647376B026F; Thu, 7 Mar 2024 13:55:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5F7776B0270; Thu, 7 Mar 2024 13:55:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 44B976B0271; Thu, 7 Mar 2024 13:55:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 3192E6B026F for ; Thu, 7 Mar 2024 13:55:04 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E1F53811F4 for ; Thu, 7 Mar 2024 18:55:03 +0000 (UTC) X-FDA: 81871145286.21.6D83CE8 Received: from mail-ua1-f48.google.com (mail-ua1-f48.google.com [209.85.222.48]) by imf23.hostedemail.com (Postfix) with ESMTP id EA964140016 for ; Thu, 7 Mar 2024 18:55:01 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="OpONMnk/"; spf=pass (imf23.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709837702; a=rsa-sha256; cv=none; b=fsjWFmMKMcvPURpXRlaynoDiYS5XiGq30zVXojoDGgIHJE0VC9DhAJbMG3uQXGmE9a0kjr Z1eiSSyXMegtswf/9l7x7ygpJLRKdh3dducaopU8mnBJ1IPJsJ+0wH9+ephb+DP180ONW1 uDuL8h/KzLShbfr56h3EJfWuYXuOBRw= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="OpONMnk/"; spf=pass (imf23.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709837702; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XgHLRC2kt2AJHtjPm8sAWCUV3DyQy9OV+QLe8onGTf0=; b=q3/e26fuuZRDldG0xb3c2FV9eIUK0LkGkmvjFxHtzU62boxd3NOuYucw5vWeFmvZCjO5HL Vh+ZdgHAWyha6GqA4P5vsrkkBBE0VgCDrPo3yYUgELPMrjgX1LlLUsPU1nv6n260wJ31OK m5IVxZyqlrCgLp3PXIJbm6G+JvWLHRk= Received: by mail-ua1-f48.google.com with SMTP id a1e0cc1a2514c-7db44846716so497624241.2 for ; Thu, 07 Mar 2024 10:55:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709837701; x=1710442501; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XgHLRC2kt2AJHtjPm8sAWCUV3DyQy9OV+QLe8onGTf0=; b=OpONMnk/mdimVTQ7pQYY7rv0EQ/cSZGT/Faa6WRCxo9YIVNmlQLViSLvGqcTUlY1Ys aYc8fHqPPRSX6hM+ltm2W4UjM47X7G6eXKQVf8bhN5BuaWZbfPl42xeInFG5wGyfsFhX EqFWe/CDTlodz8SS9SWKNy44T1vKcsnD8VFsWWc3PuNwiVZFyi8oggSKfGNc8X3R6RG6 7faQn+DceXtXuf3ve6wEsrzVUHPfR73bD0VbrimHT6kDnMXPF4G3LXLHUWbqxS7GzcWK fKjMKH7Xd/jj4EoGav+sgNlO6Y/s2ib808WUgNrNqTB/jXSYw9heiowUDjH+iGp83H9X mTEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709837701; x=1710442501; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XgHLRC2kt2AJHtjPm8sAWCUV3DyQy9OV+QLe8onGTf0=; b=mByOyGBUZAx8C4s/ojs9pC07DZUdojVXtMGAwGwZoLaMCc0KN3VJI6VjgFuMci1TfT n7OayO80X8slHiSjWkMvHgVIkT5lLSERqJYqJR8CdzRZYsq5HjNGz2ydVX1lesJP4FzC RgEB/ks7HnHSBJGg5/Z6oJg5E5mbQXmIAHV35gB8mOgjuUi5KceOlGjUBTIHHlYnpy9K 2qIc69/oKPP0O/DuisWoGCCKtksSGbICgkexVWrIkxnNxPi6vn2ehruXpRlGcvgsreUa 1b5ssBHY82ojZ3R7sInqGjNdWOqHteStEdaXJX6NB7trqteT6ssn+VzMUFPcZHdi82tP qzHA== X-Forwarded-Encrypted: i=1; AJvYcCUFWWgPBp8icvmWfRUS6THAeN7xgCGdtpr9x/Rdyn5bYeZgIH6O2pV8jPmkeK4VvOo3femMZXdLeS2sZvg27qhIu/8= X-Gm-Message-State: AOJu0YwLHm/6RiJGXUzn7ABEK915arzS5NjVOJuPS4lnvbhyHw/v+4em mzpbaNSsaK7cPoT78b/UTKyfKhPptlHnrXYPaSTK4LtLhnNR1yA8I28u47DtD96sK19TAwIChWf O5UjzucdqfGeW3aHKGJ1bxQ7MEHY= X-Google-Smtp-Source: AGHT+IFdTuHON9nkrzb7RAhAEtzVpHmbHVNlIsaXCT+pDle3g3V7khbmZVflduF0KlkwzHdh+hbgdHwpWqbI4F3gq58= X-Received: by 2002:a05:6102:3588:b0:473:c48:60d9 with SMTP id h8-20020a056102358800b004730c4860d9mr396446vsu.25.1709837700910; Thu, 07 Mar 2024 10:55:00 -0800 (PST) MIME-Version: 1.0 References: <20240307061425.21013-1-ioworker0@gmail.com> <03458c20-5544-411b-9b8d-b4600a9b802f@arm.com> <501c9f77-1459-467a-8619-78e86b46d300@arm.com> <8f84c7d6-982a-4933-a7a7-3f640df64991@redhat.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Fri, 8 Mar 2024 02:54:49 +0800 Message-ID: Subject: Re: [PATCH v2 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free To: Ryan Roberts Cc: David Hildenbrand , Lance Yang , Vishal Moola , akpm@linux-foundation.org, zokeefe@google.com, shy828301@gmail.com, mhocko@suse.com, fengwei.yin@intel.com, xiehuan09@gmail.com, wangkefeng.wang@huawei.com, songmuchun@bytedance.com, peterx@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: EA964140016 X-Stat-Signature: hi4y4j3r4n9e1hzczooedz8eefe6b4zj X-Rspam-User: X-HE-Tag: 1709837701-989347 X-HE-Meta: U2FsdGVkX18wC6p3e89oPc5xFZCg0sllJ/gbPuOuAmIew06/iixWW7eu926wQFqjWJV3Uncm5A/O2tqIpxbQed/S8IEh2cbPa38hmJtIG9u9YPT+sY3MN70Ooq7vj9lvIJiPlQ2ar9YBv9peKyKAyU4H/X+VIpPl367qbYnVMcyAaDsAgJzue+FkACHlGq/DNUXeb3uQ3aCVwGG7PuDLcuiPtOuSD3Nn+aaDHNd/LsihLnjBOkBo+iMgkGeq+EmTd7iehFkD2rHpYtPCvlNTGjdBib7nes10tysAxAzWihFNlE3NhE4AyCkLisqBrWF9ptCGKjLFb659IjLJjHw/mlblF5+SBAjU9b8XLmujH4tH5vqcxKS6L7kDZOmSqivtT+EYcjVvhci3pzNPcncooDAjp2QJlEyrWTNEoOd/zXplszLWCN6A0N4VoHzjX97Hba6f+IqAsq0J0GBHUTG9cs1ZgfM3cadotiYnySY1Ah4WREul0Wyx2w98ntPXJipXF5nNbl2qodOnAjRtm+KDnvu0RDD0LUA1H4peW2PwehCHFIfIOpDjuZED/JbH0Xk8KSMbywNYnIR8nop40TQHDvUrSSxpyOfLV3Zv7zwm7lrCYvq86QQFBd5aLxTqezCU/xfZ9W5kHqrRmKtdiGhuPGupR36OV1fn2UenvpyKp6HgpYBWIqwDz0o8jnbuYKgWs7XLeq2Kj6UYhRQIIhRJlA+Vd8VnoTMsUPF2Alj7IfwVi3UYXIGziKajWIcrlMztNXEXaP56ze8dlw1ARuf+QtkIPdiPKUgOLzXYhD6dpBdhlhVku/40zcPfR2H6689kp/6XQLYUc1s49qkzPehze4AFjzww3lBmboj1ugXeG1C7PV4T1CWmugN4Hp0OSNWsZB2gNttntcJrm5m2YiCPZRbjJ9z3BSkiebU+0nmIl1Qk11TyVcpjtw/jFr5kJ/zqzQcXJPDTXATtesPwB5H 086a1YG0 WCU0ruEovqgr4tGAc/OI8Of7twasZTOk2G/X6iMxvK19z58Vcc6k3yOKPs3Vbrl2aO2mVYovkkhhMd+wsbCKReBa5k7agtOVgbr/E4ZzVM0p14N+/tmTosn6QSF/iR+8P0KYdylKqM82xR9Cx/RJYd/sV0PjhbUTTamvRyNOe8rH9mPqbv0WDWvF+so5uSOBwJand1/wBvfJGfVuoCRxKmAA5HmArMu3jOJuXfF9EkiTG+rfKVNzkTuU7KKbLycaKj4S9jRyzK/RYv57zO/yQDjiqhxevjanWQZsjAIcE4bhF+cHzzAYZx/BQalKmOu3QzKlwT6zsfKnWEVbeFJr2YfAOYCtLxYRmj7u8uovkSuu90vA+ZuGMPWRPJJ9w0/0HIVwbwcJTMsO0HGDrhCVLvuMdtCEOCRAP26B2N8LLwSP6s8YO9SHMq+0X+S0uoPFWa8ygvUkBQZ3EDIppypL4ytZsNCIlmOovCngDmPRaXOQt6Ra6pDaO13Z2mquJhVQ5zJfhhK/unuMY8o2c+wpmCBSjMWjlL3VPo8FT7lcXATJSnfqi+SxnxdlFNGzYNOJ6NYiS X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 8, 2024 at 12:31=E2=80=AFAM Ryan Roberts = wrote: > > On 07/03/2024 12:01, Barry Song wrote: > > On Thu, Mar 7, 2024 at 7:45=E2=80=AFPM David Hildenbrand wrote: > >> > >> On 07.03.24 12:42, Ryan Roberts wrote: > >>> On 07/03/2024 11:31, David Hildenbrand wrote: > >>>> On 07.03.24 12:26, Barry Song wrote: > >>>>> On Thu, Mar 7, 2024 at 7:13=E2=80=AFPM Ryan Roberts wrote: > >>>>>> > >>>>>> On 07/03/2024 10:54, David Hildenbrand wrote: > >>>>>>> On 07.03.24 11:54, David Hildenbrand wrote: > >>>>>>>> On 07.03.24 11:50, Ryan Roberts wrote: > >>>>>>>>> On 07/03/2024 09:33, Barry Song wrote: > >>>>>>>>>> On Thu, Mar 7, 2024 at 10:07=E2=80=AFPM Ryan Roberts wrote: > >>>>>>>>>>> > >>>>>>>>>>> On 07/03/2024 08:10, Barry Song wrote: > >>>>>>>>>>>> On Thu, Mar 7, 2024 at 9:00=E2=80=AFPM Lance Yang wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hey Barry, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks for taking time to review! > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, Mar 7, 2024 at 3:00=E2=80=AFPM Barry Song <21cnbao@= gmail.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, Mar 7, 2024 at 7:15=E2=80=AFPM Lance Yang wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> [...] > >>>>>>>>>>>>>>> +static inline bool can_mark_large_folio_lazyfree(unsigne= d long addr, > >>>>>>>>>>>>>>> + struct f= olio *folio, > >>>>>>>>>>>>>>> pte_t *start_pte) > >>>>>>>>>>>>>>> +{ > >>>>>>>>>>>>>>> + int nr_pages =3D folio_nr_pages(folio); > >>>>>>>>>>>>>>> + fpb_t flags =3D FPB_IGNORE_DIRTY | FPB_IGNORE_SOF= T_DIRTY; > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> + for (int i =3D 0; i < nr_pages; i++) > >>>>>>>>>>>>>>> + if (page_mapcount(folio_page(folio, i)) != =3D 1) > >>>>>>>>>>>>>>> + return false; > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> we have moved to folio_estimated_sharers though it is not = precise, so > >>>>>>>>>>>>>> we don't do > >>>>>>>>>>>>>> this check with lots of loops and depending on the subpage= 's mapcount. > >>>>>>>>>>>>> > >>>>>>>>>>>>> If we don't check the subpage=E2=80=99s mapcount, and there= is a cow folio > >>>>>>>>>>>>> associated > >>>>>>>>>>>>> with this folio and the cow folio has smaller size than thi= s folio, > >>>>>>>>>>>>> should we still > >>>>>>>>>>>>> mark this folio as lazyfree? > >>>>>>>>>>>> > >>>>>>>>>>>> I agree, this is true. However, we've somehow accepted the f= act that > >>>>>>>>>>>> folio_likely_mapped_shared > >>>>>>>>>>>> can result in false negatives or false positives to balance = the > >>>>>>>>>>>> overhead. So I really don't know :-) > >>>>>>>>>>>> > >>>>>>>>>>>> Maybe David and Vishal can give some comments here. > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> BTW, do we need to rebase our work against David's changes= [1]? > >>>>>>>>>>>>>> [1] > >>>>>>>>>>>>>> https://lore.kernel.org/linux-mm/20240227201548.857831-1-d= avid@redhat.com/ > >>>>>>>>>>>>> > >>>>>>>>>>>>> Yes, we should rebase our work against David=E2=80=99s chan= ges. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> + return nr_pages =3D=3D folio_pte_batch(folio, add= r, start_pte, > >>>>>>>>>>>>>>> + ptep_get(start_p= te), nr_pages, > >>>>>>>>>>>>>>> flags, NULL); > >>>>>>>>>>>>>>> +} > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> static int madvise_free_pte_range(pmd_t *pmd, unsign= ed long addr, > >>>>>>>>>>>>>>> unsigned long end, st= ruct mm_walk > >>>>>>>>>>>>>>> *walk) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> @@ -676,11 +690,45 @@ static int madvise_free_pte_range(p= md_t *pmd, > >>>>>>>>>>>>>>> unsigned long addr, > >>>>>>>>>>>>>>> */ > >>>>>>>>>>>>>>> if (folio_test_large(folio)) { > >>>>>>>>>>>>>>> int err; > >>>>>>>>>>>>>>> + unsigned long next_addr, align; > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> - if (folio_estimated_sharers(folio= ) !=3D 1) > >>>>>>>>>>>>>>> - break; > >>>>>>>>>>>>>>> - if (!folio_trylock(folio)) > >>>>>>>>>>>>>>> - break; > >>>>>>>>>>>>>>> + if (folio_estimated_sharers(folio= ) !=3D 1 || > >>>>>>>>>>>>>>> + !folio_trylock(folio)) > >>>>>>>>>>>>>>> + goto skip_large_folio; > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I don't think we can skip all the PTEs for nr_pages, as so= me of them > >>>>>>>>>>>>>> might be > >>>>>>>>>>>>>> pointing to other folios. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> for example, for a large folio with 16PTEs, you do MADV_DO= NTNEED(15-16), > >>>>>>>>>>>>>> and write the memory of PTE15 and PTE16, you get page faul= ts, thus PTE15 > >>>>>>>>>>>>>> and PTE16 will point to two different small folios. We can= only skip > >>>>>>>>>>>>>> when we > >>>>>>>>>>>>>> are sure nr_pages =3D=3D folio_pte_batch() is sure. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Agreed. Thanks for pointing that out. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> + align =3D folio_nr_pages(folio) *= PAGE_SIZE; > >>>>>>>>>>>>>>> + next_addr =3D ALIGN_DOWN(addr + a= lign, align); > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> + /* > >>>>>>>>>>>>>>> + * If we mark only the subpages a= s lazyfree, or > >>>>>>>>>>>>>>> + * cannot mark the entire large f= olio as > >>>>>>>>>>>>>>> lazyfree, > >>>>>>>>>>>>>>> + * then just split it. > >>>>>>>>>>>>>>> + */ > >>>>>>>>>>>>>>> + if (next_addr > end || next_addr = - addr !=3D > >>>>>>>>>>>>>>> align || > >>>>>>>>>>>>>>> + !can_mark_large_folio_lazyfre= e(addr, folio, > >>>>>>>>>>>>>>> pte)) > >>>>>>>>>>>>>>> + goto split_large_folio; > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> + /* > >>>>>>>>>>>>>>> + * Avoid unnecessary folio splitt= ing if the > >>>>>>>>>>>>>>> large > >>>>>>>>>>>>>>> + * folio is entirely within the g= iven range. > >>>>>>>>>>>>>>> + */ > >>>>>>>>>>>>>>> + folio_clear_dirty(folio); > >>>>>>>>>>>>>>> + folio_unlock(folio); > >>>>>>>>>>>>>>> + for (; addr !=3D next_addr; pte++= , addr +=3D > >>>>>>>>>>>>>>> PAGE_SIZE) { > >>>>>>>>>>>>>>> + ptent =3D ptep_get(pte); > >>>>>>>>>>>>>>> + if (pte_young(ptent) || > >>>>>>>>>>>>>>> pte_dirty(ptent)) { > >>>>>>>>>>>>>>> + ptent =3D > >>>>>>>>>>>>>>> ptep_get_and_clear_full( > >>>>>>>>>>>>>>> + mm, addr,= pte, > >>>>>>>>>>>>>>> tlb->fullmm); > >>>>>>>>>>>>>>> + ptent =3D pte_mko= ld(ptent); > >>>>>>>>>>>>>>> + ptent =3D pte_mkc= lean(ptent); > >>>>>>>>>>>>>>> + set_pte_at(mm, ad= dr, pte, > >>>>>>>>>>>>>>> ptent); > >>>>>>>>>>>>>>> + tlb_remove_tlb_en= try(tlb, pte, > >>>>>>>>>>>>>>> addr); > >>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Can we do this in batches? for a CONT-PTE mapped large fol= io, you are > >>>>>>>>>>>>>> unfolding > >>>>>>>>>>>>>> and folding again. It seems quite expensive. > >>>>>>>>>>> > >>>>>>>>>>> I'm not convinced we should be doing this in batches. We want= the initial > >>>>>>>>>>> folio_pte_batch() to be as loose as possible regarding permis= sions so > >>>>>>>>>>> that we > >>>>>>>>>>> reduce our chances of splitting folios to the min. (e.g. igno= re SW bits > >>>>>>>>>>> like > >>>>>>>>>>> soft dirty, etc). I think it might be possible that some PTEs= are RO and > >>>>>>>>>>> other > >>>>>>>>>>> RW too (e.g. due to cow - although with the current cow impl,= probably not. > >>>>>>>>>>> But > >>>>>>>>>>> its fragile to assume that). Anyway, if we do an initial batc= h that ignores > >>>>>>>>>>> all > >>>>>>>>>> > >>>>>>>>>> You are correct. I believe this scenario could indeed occur. F= or instance, > >>>>>>>>>> if process A forks process B and then unmaps itself, leaving B= as the > >>>>>>>>>> sole process owning the large folio. The current wp_page_reus= e() function > >>>>>>>>>> will reuse PTE one by one while the specific subpage is writte= n. > >>>>>>>>> > >>>>>>>>> Hmm - I thought it would only reuse if the total mapcount for t= he folio > >>>>>>>>> was 1. > >>>>>>>>> And since it is a large folio with each page mapped once in pro= c B, I thought > >>>>>>>>> every subpage write would cause a copy except the last one? I h= aven't > >>>>>>>>> looked at > >>>>>>>>> the code for a while. But I had it in my head that this is an a= rea we need to > >>>>>>>>> improve for mTHP. > >>>>> > >>>>> So sad I am wrong again =F0=9F=98=A2 > >>>>> > >>>>>>>> > >>>>>>>> wp_page_reuse() will currently reuse a PTE part of a large folio= only if > >>>>>>>> a single PTE remains mapped (refcount =3D=3D 0). > >>>>>>> > >>>>>>> ^ =3D=3D 1 > >>>>> > >>>>> seems this needs improvement. it is a waste the last subpage can > >>>> > >>>> My take that is WIP: > >>>> > >>>> https://lore.kernel.org/all/20231124132626.235350-1-david@redhat.com= /T/#u > >>>> > >>>>> reuse the whole large folio. i was doing it in a quite different wa= y, > >>>>> if the large folio had only one subpage left, i would do copy and > >>>>> released the large folio[1]. and if i could reuse the whole large f= olio > >>>>> with CONT-PTE, i would reuse the whole large folio[2]. in mainline, > >>>>> we don't have this cont-pte luxury exposed to mm, so i guess we can > >>>>> not do [2] easily, but [1] seems to be an optimization. > >>>> > >>>> Yeah, I had essentially the same idea: just free up the large folio = if most of > >>>> the stuff is unmapped. But that's rather a corner-case optimization,= so I did > >>>> not proceed with that. > >>>> > >>> > >>> I'm not sure it's a corner case, really? - process forks, then both p= arent and > >>> child and write to all pages in what was previously a fully & contigu= ously > >>> mapped large folio? > >> > >> Well, with 2 MiB my assumption was that while it can happen, it's rath= er > >> rare. With smaller THP it might get more likely, agreed. > >> > >>> > >>> Reggardless, why is it an optimization to do the copy for the last su= bpage and > >>> syncrhonously free the large folio? It's already partially mapped so = is on the > >>> deferred split list and can be split if memory is tight. > > > > we don't want reclamation overhead later. and we want memories immediat= ely > > available to others. > > But by that logic, you also don't want to leave the large folio partially= mapped > all the way until the last subpage is CoWed. Surely you would want to rec= laim it > when you reach partial map status? To some extent, I agree. But then we will have two many copies. The last subpage is small, and a safe place to copy instead. We actually had to tune userspace to decrease partial map as too much partial map both unfolded CONT-PTE and wasted too much memory. if a vma had too much partial map, we disabled mTHP on this VMA. > > > reclamation will always cause latency and affect User > > experience. split_folio is not cheap :-) > > But neither is memcpy(4K) I'd imagine. But I get your point. In a real product scenario, we need to consider the success rate of allocating large folios. Currently, it's only 7%, as reported here[1], with no method to keep large folios intact in a buddy system. Yu's TAO[2] chose to release the large folio entirely after copying the mapped parts onto smaller folios in vmscan, [1] https://lore.kernel.org/linux-mm/20240305083743.24950-1-21cnbao@gmail.c= om/ [2] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@google= .com/ > > > if the number of this kind of > > large folios > > is huge, the waste can be huge for some while. > > > > it is not a corner case for large folio swap-in. while someone writes > > one subpage, I swap-in a large folio, wp_reuse will immediately > > be called. This can cause waste quite often. One outcome of this > > discussion is that I realize I should investigate this issue immediatel= y > > in the swap-in series as my off-tree code has optimized reuse but > > mainline hasn't. > > > >> > >> At least for 2 MiB THP, it might make sense to make that large folio > >> available immediately again, even without memory pressure. Even > >> compaction would not compact it. > > > > It is also true for 64KiB. as we want other processes to allocate > > 64KiB successfully as much as possible, and reduce the rate of > > falling back to small folios. by releasing 64KiB directly to buddy > > rather than splitting and returning 15*4KiB in shrinker, we reduce > > buddy fragmentation too. > > > >> > >> -- > >> Cheers, > >> > >> David / dhildenb Thanks Barry