From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52DCAC54798 for ; Thu, 7 Mar 2024 15:08:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DB28B6B01A1; Thu, 7 Mar 2024 10:08:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D61F96B01A5; Thu, 7 Mar 2024 10:08:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C02A06B01A7; Thu, 7 Mar 2024 10:08:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id ABEC46B01A1 for ; Thu, 7 Mar 2024 10:08:52 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 85E3AA04AF for ; Thu, 7 Mar 2024 15:08:52 +0000 (UTC) X-FDA: 81870575304.30.A0B9E71 Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com [209.85.219.169]) by imf25.hostedemail.com (Postfix) with ESMTP id B304CA000C for ; Thu, 7 Mar 2024 15:08:50 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ShJn52FD; spf=pass (imf25.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709824130; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YlZ1y2d1rN0MdS0vuikv6RYqGjgGQiGTPJJ6x+BlbmU=; b=LI04OmJ3MaGFnE7IsFBttx2UZ826iS7bfZ3o02PNO0JXu442peXdSqw6x1KhzfMS6zEdGJ 72pYIHrfwV3ypI84qEswlpCR7hxpesopVRDxXpHTewYFh0hhfxq3pK855dqh5b6s56rDyf MyOs559uFON9eskvhKb0Jq5eWgNTqzc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709824130; a=rsa-sha256; cv=none; b=UhNksLl9KZoZOs9J/5vhN11KEFGIcaxBeCPTBoBkYkGX710qrNcEyivbEba7N/uQZmaOmj wxlzExVHDfuAbjbsRbPwU+wj1YeYcAlRztm0J498FFIITFg3sWHLIqB7gtpB1mmpx6vWgM 5h5wBWePs3vwH59ekIHkttq6puWTPU8= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ShJn52FD; spf=pass (imf25.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-yb1-f169.google.com with SMTP id 3f1490d57ef6-dcbd1d4904dso999319276.3 for ; Thu, 07 Mar 2024 07:08:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709824130; x=1710428930; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=YlZ1y2d1rN0MdS0vuikv6RYqGjgGQiGTPJJ6x+BlbmU=; b=ShJn52FDIetl0LvnloqorH3s87wYuJacbBEFweEXYqSYBzkUcopNBbOE/e2yKYnCxN eBPLCr+peQcBF02W6eVmrBgpag8TRCZbi/eNw5BAO5iaLYW2PNokDB0f1K6CejfaOUVq YJM3yvJq/DUfBzyE/m3cR3VuV8oyWyNaU32vzDW2PL8hgAo5oS25u1D9anaTohI0QPmc 8ZwWEzkJ/oaY1E4arn3edEEchTN0Ay3H5ziEdY+ZG6p4+oHqggjWUNRdddLBe5px93/f sMzMAbBhKue91phVeOEDPIBZekCMhBqHF3mA5X3+c2C0MvfN9As+MCk5aolRpb4/HiMt ilpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709824130; x=1710428930; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YlZ1y2d1rN0MdS0vuikv6RYqGjgGQiGTPJJ6x+BlbmU=; b=IVwlLFdxSovjPLffYMJpfsWD9lOoWRXcbTCLva9KPG7ZQFDfsenWCTP8GhpmZRoxXK Uo4q6ZVwD0PYTYqRDZS0M82ZkscWOFyFnrk9Hmqnur/09qZuzcEtY5xZ/XqRGu3s30j8 zI7HUU6MgQcZcqmp12gCh3DcPf6Hl2OkrnINQoECcU9ASC4By+oi2T/abKQN7nLE1RNi Vy1aFZYuzWvmcxKaNi0oDImzNBDCypx9sLryZ6u9CfG3prQ9uA0Pk2KU3Rw6mpu3WL5r BD92ULdJS7ukedo0dVm9UGw91oVh+Mx7Nl0oIvWicEkTLXzRQ9YPG2zW5tRDSN9VRh3o mDew== X-Forwarded-Encrypted: i=1; AJvYcCXR8Lk2GQw6nvPNHptJti02a0HsI9FSMR9aGzOJ4+lRxl0Bm9PdXQtpdjcG6jFIIv2JVZJll8GJlzzxju4G4C0u7MQ= X-Gm-Message-State: AOJu0Yyts2yGrtDKgQ22oJf6lP+ab+KQI/VLXBbf3tnfpvteICflbQnk sR33alwax5yl9HHQ2hfN5vDXhJ+4gpuAxI6p1HtPFVHhYBLVCfqoiKedNPVgIpy89TnnApBgVTl xqTuu6vQdI5PP4iRE96bdjdnqsvQ= X-Google-Smtp-Source: AGHT+IGDkLYfB32LpSCJw78OUjDKJrTPH3GKXcBt7MQVQHNUYkR1SiDLPQoaBx1AD5y09CFZJtxmPq0Op8GQTCRk0BY= X-Received: by 2002:a25:cd42:0:b0:dcc:1c6c:430d with SMTP id d63-20020a25cd42000000b00dcc1c6c430dmr13738605ybf.12.1709824129467; Thu, 07 Mar 2024 07:08:49 -0800 (PST) MIME-Version: 1.0 References: <20240307061425.21013-1-ioworker0@gmail.com> <03458c20-5544-411b-9b8d-b4600a9b802f@arm.com> <501c9f77-1459-467a-8619-78e86b46d300@arm.com> <8f84c7d6-982a-4933-a7a7-3f640df64991@redhat.com> <63801377-2648-4c3b-b534-3cc5835f5cf6@redhat.com> In-Reply-To: <63801377-2648-4c3b-b534-3cc5835f5cf6@redhat.com> From: Lance Yang Date: Thu, 7 Mar 2024 23:08:37 +0800 Message-ID: Subject: Re: [PATCH v2 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free To: David Hildenbrand Cc: Ryan Roberts , Barry Song <21cnbao@gmail.com>, Vishal Moola , akpm@linux-foundation.org, zokeefe@google.com, shy828301@gmail.com, mhocko@suse.com, fengwei.yin@intel.com, xiehuan09@gmail.com, wangkefeng.wang@huawei.com, songmuchun@bytedance.com, peterx@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: B304CA000C X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: d7qxh4qykidmcoqjoo4dkr5qepnk3ar1 X-HE-Tag: 1709824130-414067 X-HE-Meta: U2FsdGVkX18LWtYhEJjRWIk7+iU7CRegRO2aGAOAD40sXqUgQM0fkX27o5V+ttHEaILXD1IvS8wCHXtFQYOh3s7lpcTgVAxkI4T8Bpg5/r90eqKi5L9ZsyaAcdVwleJjnJFs/iXDMQ9XinDutmfNHnQ3ceny9tRYrILYJNUNU/zlE6NUf0YeDKtOIDdhvT/1v3GuOXrew3wdQ1GG04mjpmICaDb9dqdJGjg+5Hh0KxW/AAIs+5y/YnI66mt2XTuRr8SoO3L6tGL5pMF600evzcKL3IgFrYc+vsiNfyCFhFoflOLyLPLCtGnV1iGpLz1sOlJwRBwZ1O7j9o1IwUGpOUSHawe8ra3j4C+qadvk64F3gen12XHV4ZK+ruj9tY7h48icZ+3r1aGpn6EJHpasLbPR5JOj5F2bPU743F9oNfoeUFOeb7t6gklb04VG7wUMC7o7GUlaINpN6mFhwViHLCq6cPB4OWs6Wbqv7m86O2Nlo45P71mWCskAd/5jgN1PRECN0CqX9wZAKLapRiTBy7odIlRAUtmGw71wyPI9PDMYQDGZ1mxCpD6e0pPEfdaXA1Wmv5I6i/kyPL72c7kcOsl3ReWusVWJ3K1khLTEqbsCa3JIaNmgeZkE3e7kAnv/S1tVUi4Xo1gZ01EUAthBMomnwDtChIG4OI0HSJUjMOnaDatFgYq/Z2vmnwYIobs9mbjYY+sl33cYaWlyhYZg31Q9jQ1qL6ZifemCfS5kA6PACazPg/MmWyr3iM/ipucEO+fhC/FE/+Lno39fCqBcP6pCGqk0Z6QAXNIvANC1tv9htG6nT+fAkK57jL4bCfZyMSyuQklbZACYZlTTa/0j81y1eFkKkR3WbmcEldiik37dl535Vr4E844oB9plEfUf6lZlgtiUdHGuHkZ6YgpIqBV5r/2ZGIH6rdyy3p18eHaXYmEmfgJ7KIDpYi5tReku6KhEoeRosemwGD4HWnH pKlQv4C7 lDNqA6SH8gzhF2VNcUJEVcODx5vB8zNTPlP/Uip9ATwxy69z0JoTpqTI/N8FEYyrXbAor X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Thanks a lot, David! Got it. I'll do my best. Thanks, Lance On Thu, Mar 7, 2024 at 10:58=E2=80=AFPM David Hildenbrand wrote: > > On 07.03.24 15:41, Lance Yang wrote: > > Hey Barry, Ryan, David, > > > > Thanks a lot for taking the time to explain and provide suggestions! > > I really appreciate your time! > > > > IIUC, here's what we need to do for v3: > > > > If folio_likely_mapped_shared() is true, or if we cannot acquire > > the folio lock, we simply skip the batched PTEs. Then, we compare > > the number of batched PTEs against folio_mapcount(). Finally, > > batch-update the access and dirty only. > > > > I'm not sure if I've understood correctly, could you please confirm? > > > > Thanks, > > Lance > > > > On Thu, Mar 7, 2024 at 7:17=E2=80=AFPM David Hildenbrand wrote: > >> > >> On 07.03.24 12:13, Ryan Roberts wrote: > >>> On 07/03/2024 10:54, David Hildenbrand wrote: > >>>> On 07.03.24 11:54, David Hildenbrand wrote: > >>>>> On 07.03.24 11:50, Ryan Roberts wrote: > >>>>>> On 07/03/2024 09:33, Barry Song wrote: > >>>>>>> On Thu, Mar 7, 2024 at 10:07=E2=80=AFPM Ryan Roberts wrote: > >>>>>>>> > >>>>>>>> On 07/03/2024 08:10, Barry Song wrote: > >>>>>>>>> On Thu, Mar 7, 2024 at 9:00=E2=80=AFPM Lance Yang wrote: > >>>>>>>>>> > >>>>>>>>>> Hey Barry, > >>>>>>>>>> > >>>>>>>>>> Thanks for taking time to review! > >>>>>>>>>> > >>>>>>>>>> On Thu, Mar 7, 2024 at 3:00=E2=80=AFPM Barry Song <21cnbao@gma= il.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Mar 7, 2024 at 7:15=E2=80=AFPM Lance Yang wrote: > >>>>>>>>>>>> > >>>>>>>>>> [...] > >>>>>>>>>>>> +static inline bool can_mark_large_folio_lazyfree(unsigned l= ong addr, > >>>>>>>>>>>> + struct foli= o *folio, > >>>>>>>>>>>> pte_t *start_pte) > >>>>>>>>>>>> +{ > >>>>>>>>>>>> + int nr_pages =3D folio_nr_pages(folio); > >>>>>>>>>>>> + fpb_t flags =3D FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_D= IRTY; > >>>>>>>>>>>> + > >>>>>>>>>>>> + for (int i =3D 0; i < nr_pages; i++) > >>>>>>>>>>>> + if (page_mapcount(folio_page(folio, i)) !=3D= 1) > >>>>>>>>>>>> + return false; > >>>>>>>>>>> > >>>>>>>>>>> we have moved to folio_estimated_sharers though it is not pre= cise, so > >>>>>>>>>>> we don't do > >>>>>>>>>>> this check with lots of loops and depending on the subpage's = mapcount. > >>>>>>>>>> > >>>>>>>>>> If we don't check the subpage=E2=80=99s mapcount, and there is= a cow folio > >>>>>>>>>> associated > >>>>>>>>>> with this folio and the cow folio has smaller size than this f= olio, > >>>>>>>>>> should we still > >>>>>>>>>> mark this folio as lazyfree? > >>>>>>>>> > >>>>>>>>> I agree, this is true. However, we've somehow accepted the fact= that > >>>>>>>>> folio_likely_mapped_shared > >>>>>>>>> can result in false negatives or false positives to balance the > >>>>>>>>> overhead. So I really don't know :-) > >>>>>>>>> > >>>>>>>>> Maybe David and Vishal can give some comments here. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> BTW, do we need to rebase our work against David's changes[1]= ? > >>>>>>>>>>> [1] > >>>>>>>>>>> https://lore.kernel.org/linux-mm/20240227201548.857831-1-davi= d@redhat.com/ > >>>>>>>>>> > >>>>>>>>>> Yes, we should rebase our work against David=E2=80=99s changes= . > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> + > >>>>>>>>>>>> + return nr_pages =3D=3D folio_pte_batch(folio, addr, = start_pte, > >>>>>>>>>>>> + ptep_get(start_pte)= , nr_pages, > >>>>>>>>>>>> flags, NULL); > >>>>>>>>>>>> +} > >>>>>>>>>>>> + > >>>>>>>>>>>> static int madvise_free_pte_range(pmd_t *pmd, unsigned = long addr, > >>>>>>>>>>>> unsigned long end, struc= t mm_walk *walk) > >>>>>>>>>>>> > >>>>>>>>>>>> @@ -676,11 +690,45 @@ static int madvise_free_pte_range(pmd_= t *pmd, > >>>>>>>>>>>> unsigned long addr, > >>>>>>>>>>>> */ > >>>>>>>>>>>> if (folio_test_large(folio)) { > >>>>>>>>>>>> int err; > >>>>>>>>>>>> + unsigned long next_addr, align; > >>>>>>>>>>>> > >>>>>>>>>>>> - if (folio_estimated_sharers(folio) != =3D 1) > >>>>>>>>>>>> - break; > >>>>>>>>>>>> - if (!folio_trylock(folio)) > >>>>>>>>>>>> - break; > >>>>>>>>>>>> + if (folio_estimated_sharers(folio) != =3D 1 || > >>>>>>>>>>>> + !folio_trylock(folio)) > >>>>>>>>>>>> + goto skip_large_folio; > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I don't think we can skip all the PTEs for nr_pages, as some = of them > >>>>>>>>>>> might be > >>>>>>>>>>> pointing to other folios. > >>>>>>>>>>> > >>>>>>>>>>> for example, for a large folio with 16PTEs, you do MADV_DONTN= EED(15-16), > >>>>>>>>>>> and write the memory of PTE15 and PTE16, you get page faults,= thus PTE15 > >>>>>>>>>>> and PTE16 will point to two different small folios. We can on= ly skip > >>>>>>>>>>> when we > >>>>>>>>>>> are sure nr_pages =3D=3D folio_pte_batch() is sure. > >>>>>>>>>> > >>>>>>>>>> Agreed. Thanks for pointing that out. > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> + > >>>>>>>>>>>> + align =3D folio_nr_pages(folio) * PA= GE_SIZE; > >>>>>>>>>>>> + next_addr =3D ALIGN_DOWN(addr + alig= n, align); > >>>>>>>>>>>> + > >>>>>>>>>>>> + /* > >>>>>>>>>>>> + * If we mark only the subpages as l= azyfree, or > >>>>>>>>>>>> + * cannot mark the entire large foli= o as lazyfree, > >>>>>>>>>>>> + * then just split it. > >>>>>>>>>>>> + */ > >>>>>>>>>>>> + if (next_addr > end || next_addr - a= ddr !=3D > >>>>>>>>>>>> align || > >>>>>>>>>>>> + !can_mark_large_folio_lazyfree(a= ddr, folio, > >>>>>>>>>>>> pte)) > >>>>>>>>>>>> + goto split_large_folio; > >>>>>>>>>>>> + > >>>>>>>>>>>> + /* > >>>>>>>>>>>> + * Avoid unnecessary folio splitting= if the large > >>>>>>>>>>>> + * folio is entirely within the give= n range. > >>>>>>>>>>>> + */ > >>>>>>>>>>>> + folio_clear_dirty(folio); > >>>>>>>>>>>> + folio_unlock(folio); > >>>>>>>>>>>> + for (; addr !=3D next_addr; pte++, a= ddr +=3D > >>>>>>>>>>>> PAGE_SIZE) { > >>>>>>>>>>>> + ptent =3D ptep_get(pte); > >>>>>>>>>>>> + if (pte_young(ptent) || > >>>>>>>>>>>> pte_dirty(ptent)) { > >>>>>>>>>>>> + ptent =3D ptep_get_a= nd_clear_full( > >>>>>>>>>>>> + mm, addr, pt= e, > >>>>>>>>>>>> tlb->fullmm); > >>>>>>>>>>>> + ptent =3D pte_mkold(= ptent); > >>>>>>>>>>>> + ptent =3D pte_mkclea= n(ptent); > >>>>>>>>>>>> + set_pte_at(mm, addr,= pte, ptent); > >>>>>>>>>>>> + tlb_remove_tlb_entry= (tlb, pte, > >>>>>>>>>>>> addr); > >>>>>>>>>>>> + } > >>>>>>>>>>> > >>>>>>>>>>> Can we do this in batches? for a CONT-PTE mapped large folio,= you are > >>>>>>>>>>> unfolding > >>>>>>>>>>> and folding again. It seems quite expensive. > >>>>>>>> > >>>>>>>> I'm not convinced we should be doing this in batches. We want th= e initial > >>>>>>>> folio_pte_batch() to be as loose as possible regarding permissio= ns so that we > >>>>>>>> reduce our chances of splitting folios to the min. (e.g. ignore = SW bits like > >>>>>>>> soft dirty, etc). I think it might be possible that some PTEs ar= e RO and other > >>>>>>>> RW too (e.g. due to cow - although with the current cow impl, pr= obably not. > >>>>>>>> But > >>>>>>>> its fragile to assume that). Anyway, if we do an initial batch t= hat ignores > >>>>>>>> all > >>>>>>> > >>>>>>> You are correct. I believe this scenario could indeed occur. For = instance, > >>>>>>> if process A forks process B and then unmaps itself, leaving B as= the > >>>>>>> sole process owning the large folio. The current wp_page_reuse()= function > >>>>>>> will reuse PTE one by one while the specific subpage is written. > >>>>>> > >>>>>> Hmm - I thought it would only reuse if the total mapcount for the = folio was 1. > >>>>>> And since it is a large folio with each page mapped once in proc B= , I thought > >>>>>> every subpage write would cause a copy except the last one? I have= n't looked at > >>>>>> the code for a while. But I had it in my head that this is an area= we need to > >>>>>> improve for mTHP. > >>>>> > >>>>> wp_page_reuse() will currently reuse a PTE part of a large folio on= ly if > >>>>> a single PTE remains mapped (refcount =3D=3D 0). > >>>> > >>>> ^ =3D=3D 1 > >>> > >>> Ahh yes. That's what I meant. I got the behacviour vagulely right tho= ugh. > >>> > >>> Anyway, regardless, I'm not sure we want to batch here. Or if we do, = we want to > >>> batch function that will only clear access and dirty. > >> > >> We likely want to detect a folio batch the "usual" way (as relaxed as > >> possible), then do all the checks (#pte =3D=3D folio_mapcount() under = folio > >> lock), and finally batch-update the access and dirty only. > > Something like: > > 1) We might want to factor out the existing single-pte case and extend > it to handle multiple PTEs (nr_pages). For the existing code, we would > pass in "nr_pages". > > For example, instead of "folio_mapcount(folio) !=3D 1" you'd check > "folio_mapcount(folio) !=3D nr_pages" in there. And we'd need functions t= o > abstract working on multiple ptes. > > 2) We'd add something like wrprotect_ptes(), that does the mkold+clean > on multiple PTEs. > > Naming suggestion for such a function requested :) > > 3) Then, we might want to extend folio_pte_batch() by an *any_young and > *any_dirty parameter that will get optimized out for other users. So you > get that information right when scanning. > > > Just a rough idea, devil is in the detail. But likely trying to abstrct > the code to handle "multiple pages of the same folio" should come just > naturally like we used to do for fork() and munmap() so far. > > -- > Cheers, > > David / dhildenb >