From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BAAE3C5475B for ; Fri, 8 Mar 2024 18:01:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2B68E6B0377; Fri, 8 Mar 2024 13:01:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 23FF06B03DA; Fri, 8 Mar 2024 13:01:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0DF7F6B03DB; Fri, 8 Mar 2024 13:01:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id EA4786B0377 for ; Fri, 8 Mar 2024 13:01:39 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 692C9140DCB for ; Fri, 8 Mar 2024 18:01:39 +0000 (UTC) X-FDA: 81874639518.23.BB1B1FF Received: from mail-vk1-f173.google.com (mail-vk1-f173.google.com [209.85.221.173]) by imf26.hostedemail.com (Postfix) with ESMTP id 29DC114002B for ; Fri, 8 Mar 2024 18:01:35 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=T3vAeLkH; spf=pass (imf26.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709920896; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DbQcqxp3h2Vdx42ene+gxiSaumSLIpwjEluWYDMZUT4=; b=wCm78LtbE68tGNJbAim6QgK/iLiAOc8IgYeS/++O8cwGweijMOHgZKqCNb+CAX/XuJ1dXd UsxB5NXkb/fq/YM0eidtJN81rjtnqV8KYOirrHnpacVnQve6yNMAsBbsieGjstl1RY9Fql I1W7DgDvq7L9fmRNQvzcNd/PV1KUnr8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709920896; a=rsa-sha256; cv=none; b=N5xdQZH2tqV5LzXHjBt9YPd1xoZkXY2Y+0c59+6gPykgxhczReHdkEBO9ExogtZt46etPI 3S0RcUIUafr6SmQi31x4/HwfGJP9WUMv+RBxdTPGXcYLuiRum8DZHv2uPjh1DYEzLYIaqs mVkeL5AkRVMGKQvvmfmeMGW0rwDelE0= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=T3vAeLkH; spf=pass (imf26.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vk1-f173.google.com with SMTP id 71dfb90a1353d-4d33d049cbaso550340e0c.0 for ; Fri, 08 Mar 2024 10:01:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709920895; x=1710525695; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=DbQcqxp3h2Vdx42ene+gxiSaumSLIpwjEluWYDMZUT4=; b=T3vAeLkHKvXZS+uZz4KlCzVaqZwDzdA1Atcht/kZyj9rG3gl8k6u4Z1I6C6Zvatkks gc9WsptqTLezlFRgaCl7dC9DZcixaW51U0VpjulZ5KDA/zCzZglTz4OyVaWT76682kLN yfY/LCClWZ3UIQV5xgiLAGtuSz/DDSaJyfLmvBTRKp5XcoSvqQoOwGae6ACNsGnx546s 4WT5qQtJB1gu2HB4hlMFVnsHnwV3bptoFW48Ufbyd6l2NgO7NPOUmOmAfo9x+V/Wuip1 eQdAlEAPsUKu1Gom0qBFSPRWYLTnMKjAN4vv4Lzh41WVGVPI+18eUGjcX1xWKvQSnexQ RpYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709920895; x=1710525695; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DbQcqxp3h2Vdx42ene+gxiSaumSLIpwjEluWYDMZUT4=; b=wUU/cciFGYMqNaNWZLjb1PkL1aHl2UvpHqPS1i5G/8yBY8SZ0NquKG+4HYik+bxDnB bGacb6UhANF2UkQF53N+P+65mhi93IztaSEZ62SA9AMwmJqIc/cdkbT5YCdYFT5e1eID EW2fp6RYdJfBl2ZHawsuS327IqAVH8sEj9We/EI/1fTHHUp5/5sDmEPb2IoQvLx1sIFG Nc0bS7V6ZQ2ejC7FoFZEuOadffyqnoy47ctdHdEOLVxd6AAXEUjAsrGVXi01IdJ7KHue U3NyDwdzGcTMiz3cRq9XASW6N+vm88g7pI82fpOysvK+cr+12l08xSiyfdGmKzVvs3Ww DEoQ== X-Forwarded-Encrypted: i=1; AJvYcCV8WD8yJEckzpvcpjYvJpYEwMuW09EFrGdU3TADblYs2Bxep0sfcsgu8rpe+8dAowF26Ol/6XalIYHdWdTJSG4oKTc= X-Gm-Message-State: AOJu0YyTrtRxlmKpsQ/Spm44SeQlQHQ5dzWmOyMJYkrjKpzmUU7ZoOYM kasp6Oa6tJVzvqfuKj0cXSqHdEPAHGPwT45TQATQSP8YEp66IxNzjsvwYp3YGVtwnGEZWQ66Mhf gUPoNCYINuds9/s79TKQK6W2rhEc= X-Google-Smtp-Source: AGHT+IHuB5A/FPA1bD3/ChfkG9ePX6+irzv80qlC+4gl0P8ZZgyBbwpjuWeW1xZfeTVsZ0ILejcdeEzhCMfaa5oQAUk= X-Received: by 2002:a05:6122:7ca:b0:4c7:7407:e8ab with SMTP id l10-20020a05612207ca00b004c77407e8abmr20694vkr.12.1709920894828; Fri, 08 Mar 2024 10:01:34 -0800 (PST) MIME-Version: 1.0 References: <20240307061425.21013-1-ioworker0@gmail.com> <03458c20-5544-411b-9b8d-b4600a9b802f@arm.com> <501c9f77-1459-467a-8619-78e86b46d300@arm.com> <8f84c7d6-982a-4933-a7a7-3f640df64991@redhat.com> <60dc7309-cb38-45e3-b2c0-ff0119202a12@arm.com> In-Reply-To: <60dc7309-cb38-45e3-b2c0-ff0119202a12@arm.com> From: Barry Song <21cnbao@gmail.com> Date: Sat, 9 Mar 2024 02:01:22 +0800 Message-ID: Subject: Re: [PATCH v2 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free To: Ryan Roberts Cc: David Hildenbrand , Lance Yang , Vishal Moola , akpm@linux-foundation.org, zokeefe@google.com, shy828301@gmail.com, mhocko@suse.com, fengwei.yin@intel.com, xiehuan09@gmail.com, wangkefeng.wang@huawei.com, songmuchun@bytedance.com, peterx@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 29DC114002B X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: m4orfnf7zmhpzdsw4kup1xwmmqxzatsa X-HE-Tag: 1709920895-479878 X-HE-Meta: U2FsdGVkX1/dDw/eaVMl8X3/y9HU8hN1IyTE24/iviCt3lmNFyxHH3Zhj1UwelHe6JdrceR4H8GdEpqTMx3IRvWrGMCb7FYaj3Z+TJeTw4hmC8ooJKJt3ctfATri//Ys698EZu4zlaqIS04pbc3AmBAvBRaUyQq0DqywBHADJY3DhkPw5S81gW1EH88r/8IDWR0i1/cOMLR1f/TeAYqbdECIlndqX3aInvtrVBi5yXVzIVrMuNsy6TdVDZBt3IIQKw4plWlyt5xjimYhRq7JCCIUPpe62BeiJ9/gVdNPTW09YuxaUDR+ghpeh7Z+Fn1XxQkwqbXcoP+9f4yfRM5smsPDJLILpS7A66vIAtDvlAB+vXhceUvVSMU2MHPzVekxMA6TVmKkdhaufDK/vSXqe7Es4Gp2exLkdBIcJNLdQAGysKVXc1qitwEykWRxFk6/sRxOlEsrRr19F/IJsJ+gRUSC8HCMKn+f6rczpNtPlmMF32QywEMdGwPGgsl7jnAgEU3GWencVx6DMQ5EV7mIBW71kjmGACfGH0rYjYKFp3gEQ4zx1w5KoTt52Oo3Y5HXEQsgV/iav9MO1KNVSWDEpwbVkMqaDRfsHgFVY4jlvWS6fw7HSF24kTj04gn0dIr5rS230dFgdkGtsHNZDN5DrbcWu5yfYRQVRghMMXq6+Vd79O+c7Df+mPwt+ww/dg57CEktCuepjqxFNRT2vWgvwpsJvWCeiFTrcUcn7kF77CLdlxH6DZOd+yImzMsUASvlc89NNj3rEgCkeOyqIoEyFfS+JOVu84aG3EZrBeWzFWAt80F/pNy0Hti3Ykia5uUxFiCuXPAZpMZsSjBWNPouOvKKlNoKhy83W+AgFy1MoN+HXekN6z5JRC0ml/UqRM3TzcO9EPAnd1/83CsfIpbPo9QN4Dlc0F5a6rS01aa8RkKDXa/m8E/pQ6uebLj1IJ2wsbYtba0ObAi/uh5NEnS tK6E7Q4q XpwRdn0Xl0PZmMZcZvPEAE1uGgQS5M2alvX/sqWaSiF9+HbB6J9uqyq2QzkSVX7dwy5cdFN7+Xj5aFhJ1SivB5nBXCE5ihKTp1fiDH5vq6xI39LqeIvv1f+I21JSTvd9ef2knsVCx7V8lcJRSifrJ7ShiJypsdtrc9BUd X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 8, 2024 at 9:05=E2=80=AFPM Ryan Roberts = wrote: > > On 07/03/2024 18:54, Barry Song wrote: > > On Fri, Mar 8, 2024 at 12:31=E2=80=AFAM Ryan Roberts wrote: > >> > >> On 07/03/2024 12:01, Barry Song wrote: > >>> On Thu, Mar 7, 2024 at 7:45=E2=80=AFPM David Hildenbrand wrote: > >>>> > >>>> On 07.03.24 12:42, Ryan Roberts wrote: > >>>>> On 07/03/2024 11:31, David Hildenbrand wrote: > >>>>>> On 07.03.24 12:26, Barry Song wrote: > >>>>>>> On Thu, Mar 7, 2024 at 7:13=E2=80=AFPM Ryan Roberts wrote: > >>>>>>>> > >>>>>>>> On 07/03/2024 10:54, David Hildenbrand wrote: > >>>>>>>>> On 07.03.24 11:54, David Hildenbrand wrote: > >>>>>>>>>> On 07.03.24 11:50, Ryan Roberts wrote: > >>>>>>>>>>> On 07/03/2024 09:33, Barry Song wrote: > >>>>>>>>>>>> On Thu, Mar 7, 2024 at 10:07=E2=80=AFPM Ryan Roberts wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 07/03/2024 08:10, Barry Song wrote: > >>>>>>>>>>>>>> On Thu, Mar 7, 2024 at 9:00=E2=80=AFPM Lance Yang wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hey Barry, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks for taking time to review! > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Thu, Mar 7, 2024 at 3:00=E2=80=AFPM Barry Song <21cnba= o@gmail.com> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Thu, Mar 7, 2024 at 7:15=E2=80=AFPM Lance Yang wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> [...] > >>>>>>>>>>>>>>>>> +static inline bool can_mark_large_folio_lazyfree(unsig= ned long addr, > >>>>>>>>>>>>>>>>> + struct= folio *folio, > >>>>>>>>>>>>>>>>> pte_t *start_pte) > >>>>>>>>>>>>>>>>> +{ > >>>>>>>>>>>>>>>>> + int nr_pages =3D folio_nr_pages(folio); > >>>>>>>>>>>>>>>>> + fpb_t flags =3D FPB_IGNORE_DIRTY | FPB_IGNORE_S= OFT_DIRTY; > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> + for (int i =3D 0; i < nr_pages; i++) > >>>>>>>>>>>>>>>>> + if (page_mapcount(folio_page(folio, i))= !=3D 1) > >>>>>>>>>>>>>>>>> + return false; > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> we have moved to folio_estimated_sharers though it is no= t precise, so > >>>>>>>>>>>>>>>> we don't do > >>>>>>>>>>>>>>>> this check with lots of loops and depending on the subpa= ge's mapcount. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> If we don't check the subpage=E2=80=99s mapcount, and the= re is a cow folio > >>>>>>>>>>>>>>> associated > >>>>>>>>>>>>>>> with this folio and the cow folio has smaller size than t= his folio, > >>>>>>>>>>>>>>> should we still > >>>>>>>>>>>>>>> mark this folio as lazyfree? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I agree, this is true. However, we've somehow accepted the= fact that > >>>>>>>>>>>>>> folio_likely_mapped_shared > >>>>>>>>>>>>>> can result in false negatives or false positives to balanc= e the > >>>>>>>>>>>>>> overhead. So I really don't know :-) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Maybe David and Vishal can give some comments here. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> BTW, do we need to rebase our work against David's chang= es[1]? > >>>>>>>>>>>>>>>> [1] > >>>>>>>>>>>>>>>> https://lore.kernel.org/linux-mm/20240227201548.857831-1= -david@redhat.com/ > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Yes, we should rebase our work against David=E2=80=99s ch= anges. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> + return nr_pages =3D=3D folio_pte_batch(folio, a= ddr, start_pte, > >>>>>>>>>>>>>>>>> + ptep_get(start= _pte), nr_pages, > >>>>>>>>>>>>>>>>> flags, NULL); > >>>>>>>>>>>>>>>>> +} > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> static int madvise_free_pte_range(pmd_t *pmd, unsi= gned long addr, > >>>>>>>>>>>>>>>>> unsigned long end, = struct mm_walk > >>>>>>>>>>>>>>>>> *walk) > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> @@ -676,11 +690,45 @@ static int madvise_free_pte_range= (pmd_t *pmd, > >>>>>>>>>>>>>>>>> unsigned long addr, > >>>>>>>>>>>>>>>>> */ > >>>>>>>>>>>>>>>>> if (folio_test_large(folio)) { > >>>>>>>>>>>>>>>>> int err; > >>>>>>>>>>>>>>>>> + unsigned long next_addr, align; > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> - if (folio_estimated_sharers(fol= io) !=3D 1) > >>>>>>>>>>>>>>>>> - break; > >>>>>>>>>>>>>>>>> - if (!folio_trylock(folio)) > >>>>>>>>>>>>>>>>> - break; > >>>>>>>>>>>>>>>>> + if (folio_estimated_sharers(fol= io) !=3D 1 || > >>>>>>>>>>>>>>>>> + !folio_trylock(folio)) > >>>>>>>>>>>>>>>>> + goto skip_large_folio; > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I don't think we can skip all the PTEs for nr_pages, as = some of them > >>>>>>>>>>>>>>>> might be > >>>>>>>>>>>>>>>> pointing to other folios. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> for example, for a large folio with 16PTEs, you do MADV_= DONTNEED(15-16), > >>>>>>>>>>>>>>>> and write the memory of PTE15 and PTE16, you get page fa= ults, thus PTE15 > >>>>>>>>>>>>>>>> and PTE16 will point to two different small folios. We c= an only skip > >>>>>>>>>>>>>>>> when we > >>>>>>>>>>>>>>>> are sure nr_pages =3D=3D folio_pte_batch() is sure. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Agreed. Thanks for pointing that out. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> + align =3D folio_nr_pages(folio)= * PAGE_SIZE; > >>>>>>>>>>>>>>>>> + next_addr =3D ALIGN_DOWN(addr += align, align); > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> + /* > >>>>>>>>>>>>>>>>> + * If we mark only the subpages= as lazyfree, or > >>>>>>>>>>>>>>>>> + * cannot mark the entire large= folio as > >>>>>>>>>>>>>>>>> lazyfree, > >>>>>>>>>>>>>>>>> + * then just split it. > >>>>>>>>>>>>>>>>> + */ > >>>>>>>>>>>>>>>>> + if (next_addr > end || next_add= r - addr !=3D > >>>>>>>>>>>>>>>>> align || > >>>>>>>>>>>>>>>>> + !can_mark_large_folio_lazyf= ree(addr, folio, > >>>>>>>>>>>>>>>>> pte)) > >>>>>>>>>>>>>>>>> + goto split_large_folio; > >>>>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>>>> + /* > >>>>>>>>>>>>>>>>> + * Avoid unnecessary folio spli= tting if the > >>>>>>>>>>>>>>>>> large > >>>>>>>>>>>>>>>>> + * folio is entirely within the= given range. > >>>>>>>>>>>>>>>>> + */ > >>>>>>>>>>>>>>>>> + folio_clear_dirty(folio); > >>>>>>>>>>>>>>>>> + folio_unlock(folio); > >>>>>>>>>>>>>>>>> + for (; addr !=3D next_addr; pte= ++, addr +=3D > >>>>>>>>>>>>>>>>> PAGE_SIZE) { > >>>>>>>>>>>>>>>>> + ptent =3D ptep_get(pte)= ; > >>>>>>>>>>>>>>>>> + if (pte_young(ptent) || > >>>>>>>>>>>>>>>>> pte_dirty(ptent)) { > >>>>>>>>>>>>>>>>> + ptent =3D > >>>>>>>>>>>>>>>>> ptep_get_and_clear_full( > >>>>>>>>>>>>>>>>> + mm, add= r, pte, > >>>>>>>>>>>>>>>>> tlb->fullmm); > >>>>>>>>>>>>>>>>> + ptent =3D pte_m= kold(ptent); > >>>>>>>>>>>>>>>>> + ptent =3D pte_m= kclean(ptent); > >>>>>>>>>>>>>>>>> + set_pte_at(mm, = addr, pte, > >>>>>>>>>>>>>>>>> ptent); > >>>>>>>>>>>>>>>>> + tlb_remove_tlb_= entry(tlb, pte, > >>>>>>>>>>>>>>>>> addr); > >>>>>>>>>>>>>>>>> + } > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Can we do this in batches? for a CONT-PTE mapped large f= olio, you are > >>>>>>>>>>>>>>>> unfolding > >>>>>>>>>>>>>>>> and folding again. It seems quite expensive. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I'm not convinced we should be doing this in batches. We wa= nt the initial > >>>>>>>>>>>>> folio_pte_batch() to be as loose as possible regarding perm= issions so > >>>>>>>>>>>>> that we > >>>>>>>>>>>>> reduce our chances of splitting folios to the min. (e.g. ig= nore SW bits > >>>>>>>>>>>>> like > >>>>>>>>>>>>> soft dirty, etc). I think it might be possible that some PT= Es are RO and > >>>>>>>>>>>>> other > >>>>>>>>>>>>> RW too (e.g. due to cow - although with the current cow imp= l, probably not. > >>>>>>>>>>>>> But > >>>>>>>>>>>>> its fragile to assume that). Anyway, if we do an initial ba= tch that ignores > >>>>>>>>>>>>> all > >>>>>>>>>>>> > >>>>>>>>>>>> You are correct. I believe this scenario could indeed occur.= For instance, > >>>>>>>>>>>> if process A forks process B and then unmaps itself, leaving= B as the > >>>>>>>>>>>> sole process owning the large folio. The current wp_page_re= use() function > >>>>>>>>>>>> will reuse PTE one by one while the specific subpage is writ= ten. > >>>>>>>>>>> > >>>>>>>>>>> Hmm - I thought it would only reuse if the total mapcount for= the folio > >>>>>>>>>>> was 1. > >>>>>>>>>>> And since it is a large folio with each page mapped once in p= roc B, I thought > >>>>>>>>>>> every subpage write would cause a copy except the last one? I= haven't > >>>>>>>>>>> looked at > >>>>>>>>>>> the code for a while. But I had it in my head that this is an= area we need to > >>>>>>>>>>> improve for mTHP. > >>>>>>> > >>>>>>> So sad I am wrong again =F0=9F=98=A2 > >>>>>>> > >>>>>>>>>> > >>>>>>>>>> wp_page_reuse() will currently reuse a PTE part of a large fol= io only if > >>>>>>>>>> a single PTE remains mapped (refcount =3D=3D 0). > >>>>>>>>> > >>>>>>>>> ^ =3D=3D 1 > >>>>>>> > >>>>>>> seems this needs improvement. it is a waste the last subpage can > >>>>>> > >>>>>> My take that is WIP: > >>>>>> > >>>>>> https://lore.kernel.org/all/20231124132626.235350-1-david@redhat.c= om/T/#u > >>>>>> > >>>>>>> reuse the whole large folio. i was doing it in a quite different = way, > >>>>>>> if the large folio had only one subpage left, i would do copy and > >>>>>>> released the large folio[1]. and if i could reuse the whole large= folio > >>>>>>> with CONT-PTE, i would reuse the whole large folio[2]. in mainlin= e, > >>>>>>> we don't have this cont-pte luxury exposed to mm, so i guess we c= an > >>>>>>> not do [2] easily, but [1] seems to be an optimization. > >>>>>> > >>>>>> Yeah, I had essentially the same idea: just free up the large foli= o if most of > >>>>>> the stuff is unmapped. But that's rather a corner-case optimizatio= n, so I did > >>>>>> not proceed with that. > >>>>>> > >>>>> > >>>>> I'm not sure it's a corner case, really? - process forks, then both= parent and > >>>>> child and write to all pages in what was previously a fully & conti= guously > >>>>> mapped large folio? > >>>> > >>>> Well, with 2 MiB my assumption was that while it can happen, it's ra= ther > >>>> rare. With smaller THP it might get more likely, agreed. > >>>> > >>>>> > >>>>> Reggardless, why is it an optimization to do the copy for the last = subpage and > >>>>> syncrhonously free the large folio? It's already partially mapped s= o is on the > >>>>> deferred split list and can be split if memory is tight. > >>> > >>> we don't want reclamation overhead later. and we want memories immedi= ately > >>> available to others. > >> > >> But by that logic, you also don't want to leave the large folio partia= lly mapped > >> all the way until the last subpage is CoWed. Surely you would want to = reclaim it > >> when you reach partial map status? > > > > To some extent, I agree. But then we will have two many copies. The las= t > > subpage is small, and a safe place to copy instead. > > > > We actually had to tune userspace to decrease partial map as too much > > partial map both unfolded CONT-PTE and wasted too much memory. if a > > vma had too much partial map, we disabled mTHP on this VMA. > > I actually had a whacky idea around introducing selectable page size ABI > per-process that might help here. I know Android is doing work to make th= e > system 16K page compatible. You could run most of the system processes wi= th 16K > ABI on top of 4K kernel. Then those processes don't even have the ability= to > madvise/munmap/mprotect/mremap anything less than 16K alignment so that a= cts as > an anti-fragmentation mechanism while allowing non-16K capable processes = to run > side-by-side. Just a passing thought... Right, this project faces a challenge in supporting legacy 4KiB-aligned applications. but I don't find it will be an issue to run 16KiB-aligned applications on a kernel whose page size is 4KiB. > > > > >> > >>> reclamation will always cause latency and affect User > >>> experience. split_folio is not cheap :-) > >> > >> But neither is memcpy(4K) I'd imagine. But I get your point. > > > > In a real product scenario, we need to consider the success rate of > > allocating large folios. > > Currently, it's only 7%, as reported here[1], with no method to keep > > large folios intact in a > > buddy system. > > Yes I saw that email - interesting. Is that 7% number for the Oppo > implementation or upstream implementation? (I think Oppo?). Do you know h= ow the > other one compares (my guess is that upstream isn't complete enough yet t= o give > viable numbers)? And I'm guessing you are running on a kernel/fs that doe= sn't > support large folios in the page cache? What about large folio swap? My f= eeling > is that once we have all these features, that number should significantly > increase because you can create a system that essentially uses large quan= tities > of a couple of sizes of page (e.g. 4K & (16K | 64K)) and fragmentation wi= ll be > less of a problem. Perhaps that's wishful thinking though. This is the number of OPPO's implementations which supports one kind of large folio size - 64KiB only. Meanwhile, OPPO has a TAO-like optimization by extending migrate_type and marking some pageblocks dedicated for large folios(except = some corner cases , 3-order can also use them), it brings success rate to around 50% in do_anon_page and more than 90% in do_swap_page(we give this lower water as we save large objects in zsmalloc/zram - compressing and decompressing 64Ki= B as a whole instead of doing 16 * 4KiB). The reported data is disabling the TAO-like optimization and just using bud= dy. BTW, based on the previous observation, 16KiB allocation could still be a problem on phones, for example, kernel stacks allocation was a pain before it was changed to vmalloc. > > > > > Yu's TAO[2] chose to release the large folio entirely after copying > > the mapped parts > > onto smaller folios in vmscan, > > Yes, TAO looks very interesting! It essentially partitions the memory IIU= C? kind of, adding two virtual zones to decrease compaction and keep large folios intact/not being splitted. > > > > > [1] https://lore.kernel.org/linux-mm/20240305083743.24950-1-21cnbao@gma= il.com/ > > [2] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@go= ogle.com/ > > > >> > >>> if the number of this kind of > >>> large folios > >>> is huge, the waste can be huge for some while. > >>> > >>> it is not a corner case for large folio swap-in. while someone writes > >>> one subpage, I swap-in a large folio, wp_reuse will immediately > >>> be called. This can cause waste quite often. One outcome of this > >>> discussion is that I realize I should investigate this issue immediat= ely > >>> in the swap-in series as my off-tree code has optimized reuse but > >>> mainline hasn't. > >>> > >>>> > >>>> At least for 2 MiB THP, it might make sense to make that large folio > >>>> available immediately again, even without memory pressure. Even > >>>> compaction would not compact it. > >>> > >>> It is also true for 64KiB. as we want other processes to allocate > >>> 64KiB successfully as much as possible, and reduce the rate of > >>> falling back to small folios. by releasing 64KiB directly to buddy > >>> rather than splitting and returning 15*4KiB in shrinker, we reduce > >>> buddy fragmentation too. > >>> > >>>> > >>>> -- > >>>> Cheers, > >>>> > >>>> David / dhildenb > > Thanks Barry >