From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57AA7C54E49 for ; Thu, 7 Mar 2024 11:27:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C24E96B0162; Thu, 7 Mar 2024 06:27:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BAD796B0163; Thu, 7 Mar 2024 06:27:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A74C06B0164; Thu, 7 Mar 2024 06:27:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 934D16B0162 for ; Thu, 7 Mar 2024 06:27:03 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 53BD041169 for ; Thu, 7 Mar 2024 11:27:03 +0000 (UTC) X-FDA: 81870016326.18.C907426 Received: from mail-vk1-f176.google.com (mail-vk1-f176.google.com [209.85.221.176]) by imf30.hostedemail.com (Postfix) with ESMTP id 9A4D48000D for ; Thu, 7 Mar 2024 11:27:01 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Yu94lV5X; spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.176 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709810821; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=U5/UYxQ+V8iWd+wFm7wUJaiQxt6gTceVRV1QKSz0IOs=; b=0Eq57zl6wNfewUg1itHtxPkNcphd5dpjZS/7oVBNupkp2NelHBs2ixJrUpVSbxTrb1m6MA FNnJ1ql8J8QjJR8JoUuEFSF67aYNo3Y386hXrVNCp4ar3vfapzDbeFp044IMiBLB+vVSSs wg+wqj2cNKpzRFDzOUOrTCDneJAo8Xo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709810821; a=rsa-sha256; cv=none; b=YqZ/OZefEYIW1NFNzo2JWv6G47eUGBJct/Xn8PDu3d651Z5XPwhbAhRobw1s5x1St/CYjD 3zLtKKkWiiIIfYqNWK3dH3HknYK82KhaWTLHhM8xPWiBQCVOoE3TJZ0j0spgmi/0yCmK1Q 490wwvPAcivvSVh48mjngTTuKX4HBzE= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Yu94lV5X; spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.176 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vk1-f176.google.com with SMTP id 71dfb90a1353d-4d33e0c1b3eso89770e0c.2 for ; Thu, 07 Mar 2024 03:27:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709810820; x=1710415620; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=U5/UYxQ+V8iWd+wFm7wUJaiQxt6gTceVRV1QKSz0IOs=; b=Yu94lV5XpdzT53zkM6GzP74wlXX79DTGFW+GDciQClDn+7+bpusR409b9xSpTpR9zc UIC3vsjr45RWJvLAd6u3FxuFEBbEwWCj8z1j3MEQJS5MksMAr75UMee+JIDHmAKXp6Uw 9DoRTeCB0tHRqkcq7rN+4FTClQLxwpDqwyqg9x5cwYO7vi6XetTzs6PDj3HoMSRTpvTR CmLUCB8F7dXHj4II5CI+37aZ42eMEzSSK+Di1Gdl/c2EWu9Nd6y1iaGyM/TKME0hfhEk Z5I8PVHlq6xTBvtutmjonE8qVzV8oTt+CpuKxOA27OmtTwf3J1LZDtKJhdd4UPVUL7FO lpGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709810820; x=1710415620; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=U5/UYxQ+V8iWd+wFm7wUJaiQxt6gTceVRV1QKSz0IOs=; b=SFa0t/caD1NBGcAALYIOuDnwITEmlYIx9zXcoBVYqJ0B4n5118x/QSJl91G4igvTCI FGLUXThdZGlkTd0EGRXe72V+GQ1xyHRqOpHpy4nrkV5VkBzhFdmkwQHgbYiel3VMBDFc bwb6vgL8LkjHrqVgr0ZbwkXql+BZl9+S3YL+GrwndAp1AW2HNQ5m9FXClLy3LLATrkiR x1dZflCZhg3HnEgUhL2IvmtD34fkR9SwM/j/7OwR9DU9YWvo50FZy1FL2RQm4JDHI+k5 oLxZDPziujZvY3lmnUQjxshiFADySPtKPuMaqyQJXOHpAHWeriR3cp24sZz2GHxPilvS 5Wmw== X-Forwarded-Encrypted: i=1; AJvYcCXkl9OWAR9Z7LIsU3hgNG7TbFlDSphEGZNBbYuj62AOOJfOAykgL8xgXivMs1o7N/ZCq0JnEVwEjfm+RS9Qa0ypWrA= X-Gm-Message-State: AOJu0Yxz3yFkOL3L4je7dJ0JXVldkr9vziybnrF493yUvTV5iSRRfHLe 0XyalmAvXz4myyRbKfk7lJ+tXKafOCXOhrgOilkb/oL9kh+4JhMZuLGvm3D/tir/6i5Gr72h56o 6u1glYNNlM8JIyP6vdqpNYgMR744= X-Google-Smtp-Source: AGHT+IE6IVoDMMSp9p1MaTPE0G9ltm+lL5xNYXrUIzXUdFX9G8XAi+8LYpoVHhbZ9LrjwLylG9cnl6Mx+I+iygk9dkQ= X-Received: by 2002:a05:6122:2908:b0:4d3:36b9:2c26 with SMTP id fm8-20020a056122290800b004d336b92c26mr7853739vkb.14.1709810820624; Thu, 07 Mar 2024 03:27:00 -0800 (PST) MIME-Version: 1.0 References: <20240307061425.21013-1-ioworker0@gmail.com> <03458c20-5544-411b-9b8d-b4600a9b802f@arm.com> <501c9f77-1459-467a-8619-78e86b46d300@arm.com> <8f84c7d6-982a-4933-a7a7-3f640df64991@redhat.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 7 Mar 2024 19:26:49 +0800 Message-ID: Subject: Re: [PATCH v2 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free To: Ryan Roberts Cc: David Hildenbrand , Lance Yang , Vishal Moola , akpm@linux-foundation.org, zokeefe@google.com, shy828301@gmail.com, mhocko@suse.com, fengwei.yin@intel.com, xiehuan09@gmail.com, wangkefeng.wang@huawei.com, songmuchun@bytedance.com, peterx@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: w9ir4h995nha9zd7hszbf6od1n4cj8u8 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 9A4D48000D X-Rspam-User: X-HE-Tag: 1709810821-473943 X-HE-Meta: U2FsdGVkX1/y4gxtD3QrjfS477eOIbvLDZCxxJyIRmRZQjEnkHwRczldSgk/dU6cHHJMfdsfBviWsX0P+zIVVxoSTi0fkLL4odUQ20HEyCH6ZfRrtoqgCW406xouEgCGrA+Xo5xk0dE0TZaY/cWwbDeka/X+LuelY1wYnT1ykm86uyx9c9joDVq0nUsKR/SM7zdPHDE0HJX6VOsm1rau2lJgxQayAUMTVVtvS+99FyIdgxwzNjcPEhkgkr/sPUKmwLHAeg8YN7AQlKeGzHqOkcxPCgxXvlvY7cvCjqS8iRU4IjoKhqllfS8XkHIhL6XMS6lgpRNDJk4U8PD3ylbYDmIa2TDj8WgcyOmQ0nmRt/cTVlRc2wps/c+OC//nK8wMH5bXp+/GeCpJlptOsSeRRFk7D6PUQtSMlS4C9ub2fzytiK38tEfhxLrsEBVYu97uaQlWRVCtHs98YPAsQKzoOwpVqXNyg5ZgnicmSLYlVHPNGpsQGY5DhvI3lvjf07+mu/YeMqWpUUWF9NlYQi57o8IG4Vg4WIsAjWJUIcjMG0x9pj1DdeIRMH/e82HKtUORJZid+L8tenZ5urfUdaGbc7cUgjtKBhv+TWrHVxSQFxu3K1+piB66hcNyG01YBmVtzfX68Nw57xFjPJrsRI5rBnCfSr/RFyxh/kR6Bpsxq2n/VL3jXXYU0njBFkbKKc/SwjX1jazESyfi7PwNjuURV0a4fsLD73jffq6st5KUJUc7K05wBkqfXswJv8I7vsUhyvcyZxzmR9Z6YJhmDUiBGs2V1KUe7uY98ZWbNlbbdCvgcITm2LUKm9B1B54+aUbvNnfdckq3RsNh6AzDwBI1CtL3zv2yzSz7pL0Q9E5ApKF6UeHjYH64oTTkOK4xO3ux7bQV3tsP03TWuupm3kFOsNSerLb1r+XDZNqROcpZzDGXbiyHT2oOWVrsZsbzL6uVk4tAYB8aMJ8xSvTOyHO RIfOCms3 omCiIiQrEZFqFSjNVAUkN/A6lySCy00rNYvz9jaKYVEoquPvJr9xnOKmzgiQ0nmXRgKSRzqgoHVnBv/oGZLKXqQ3GaRe/MSxbE+a20pZy//Gduspi8fh1kmo4vk3JNLrBtCtT4Lnzq9V1/Zhs23nFNB2iuIurQTe8xrHT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 7, 2024 at 7:13=E2=80=AFPM Ryan Roberts = wrote: > > On 07/03/2024 10:54, David Hildenbrand wrote: > > On 07.03.24 11:54, David Hildenbrand wrote: > >> On 07.03.24 11:50, Ryan Roberts wrote: > >>> On 07/03/2024 09:33, Barry Song wrote: > >>>> On Thu, Mar 7, 2024 at 10:07=E2=80=AFPM Ryan Roberts wrote: > >>>>> > >>>>> On 07/03/2024 08:10, Barry Song wrote: > >>>>>> On Thu, Mar 7, 2024 at 9:00=E2=80=AFPM Lance Yang wrote: > >>>>>>> > >>>>>>> Hey Barry, > >>>>>>> > >>>>>>> Thanks for taking time to review! > >>>>>>> > >>>>>>> On Thu, Mar 7, 2024 at 3:00=E2=80=AFPM Barry Song <21cnbao@gmail.= com> wrote: > >>>>>>>> > >>>>>>>> On Thu, Mar 7, 2024 at 7:15=E2=80=AFPM Lance Yang wrote: > >>>>>>>>> > >>>>>>> [...] > >>>>>>>>> +static inline bool can_mark_large_folio_lazyfree(unsigned long= addr, > >>>>>>>>> + struct folio *= folio, > >>>>>>>>> pte_t *start_pte) > >>>>>>>>> +{ > >>>>>>>>> + int nr_pages =3D folio_nr_pages(folio); > >>>>>>>>> + fpb_t flags =3D FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRT= Y; > >>>>>>>>> + > >>>>>>>>> + for (int i =3D 0; i < nr_pages; i++) > >>>>>>>>> + if (page_mapcount(folio_page(folio, i)) !=3D 1) > >>>>>>>>> + return false; > >>>>>>>> > >>>>>>>> we have moved to folio_estimated_sharers though it is not precis= e, so > >>>>>>>> we don't do > >>>>>>>> this check with lots of loops and depending on the subpage's map= count. > >>>>>>> > >>>>>>> If we don't check the subpage=E2=80=99s mapcount, and there is a = cow folio > >>>>>>> associated > >>>>>>> with this folio and the cow folio has smaller size than this foli= o, > >>>>>>> should we still > >>>>>>> mark this folio as lazyfree? > >>>>>> > >>>>>> I agree, this is true. However, we've somehow accepted the fact th= at > >>>>>> folio_likely_mapped_shared > >>>>>> can result in false negatives or false positives to balance the > >>>>>> overhead. So I really don't know :-) > >>>>>> > >>>>>> Maybe David and Vishal can give some comments here. > >>>>>> > >>>>>>> > >>>>>>>> BTW, do we need to rebase our work against David's changes[1]? > >>>>>>>> [1] > >>>>>>>> https://lore.kernel.org/linux-mm/20240227201548.857831-1-david@r= edhat.com/ > >>>>>>> > >>>>>>> Yes, we should rebase our work against David=E2=80=99s changes. > >>>>>>> > >>>>>>>> > >>>>>>>>> + > >>>>>>>>> + return nr_pages =3D=3D folio_pte_batch(folio, addr, sta= rt_pte, > >>>>>>>>> + ptep_get(start_pte), n= r_pages, > >>>>>>>>> flags, NULL); > >>>>>>>>> +} > >>>>>>>>> + > >>>>>>>>> static int madvise_free_pte_range(pmd_t *pmd, unsigned long = addr, > >>>>>>>>> unsigned long end, struct mm_= walk *walk) > >>>>>>>>> > >>>>>>>>> @@ -676,11 +690,45 @@ static int madvise_free_pte_range(pmd_t *= pmd, > >>>>>>>>> unsigned long addr, > >>>>>>>>> */ > >>>>>>>>> if (folio_test_large(folio)) { > >>>>>>>>> int err; > >>>>>>>>> + unsigned long next_addr, align; > >>>>>>>>> > >>>>>>>>> - if (folio_estimated_sharers(folio) !=3D= 1) > >>>>>>>>> - break; > >>>>>>>>> - if (!folio_trylock(folio)) > >>>>>>>>> - break; > >>>>>>>>> + if (folio_estimated_sharers(folio) !=3D= 1 || > >>>>>>>>> + !folio_trylock(folio)) > >>>>>>>>> + goto skip_large_folio; > >>>>>>>> > >>>>>>>> > >>>>>>>> I don't think we can skip all the PTEs for nr_pages, as some of = them > >>>>>>>> might be > >>>>>>>> pointing to other folios. > >>>>>>>> > >>>>>>>> for example, for a large folio with 16PTEs, you do MADV_DONTNEED= (15-16), > >>>>>>>> and write the memory of PTE15 and PTE16, you get page faults, th= us PTE15 > >>>>>>>> and PTE16 will point to two different small folios. We can only = skip > >>>>>>>> when we > >>>>>>>> are sure nr_pages =3D=3D folio_pte_batch() is sure. > >>>>>>> > >>>>>>> Agreed. Thanks for pointing that out. > >>>>>>> > >>>>>>>> > >>>>>>>>> + > >>>>>>>>> + align =3D folio_nr_pages(folio) * PAGE_= SIZE; > >>>>>>>>> + next_addr =3D ALIGN_DOWN(addr + align, = align); > >>>>>>>>> + > >>>>>>>>> + /* > >>>>>>>>> + * If we mark only the subpages as lazy= free, or > >>>>>>>>> + * cannot mark the entire large folio a= s lazyfree, > >>>>>>>>> + * then just split it. > >>>>>>>>> + */ > >>>>>>>>> + if (next_addr > end || next_addr - addr= !=3D > >>>>>>>>> align || > >>>>>>>>> + !can_mark_large_folio_lazyfree(addr= , folio, > >>>>>>>>> pte)) > >>>>>>>>> + goto split_large_folio; > >>>>>>>>> + > >>>>>>>>> + /* > >>>>>>>>> + * Avoid unnecessary folio splitting if= the large > >>>>>>>>> + * folio is entirely within the given r= ange. > >>>>>>>>> + */ > >>>>>>>>> + folio_clear_dirty(folio); > >>>>>>>>> + folio_unlock(folio); > >>>>>>>>> + for (; addr !=3D next_addr; pte++, addr= +=3D > >>>>>>>>> PAGE_SIZE) { > >>>>>>>>> + ptent =3D ptep_get(pte); > >>>>>>>>> + if (pte_young(ptent) || > >>>>>>>>> pte_dirty(ptent)) { > >>>>>>>>> + ptent =3D ptep_get_and_= clear_full( > >>>>>>>>> + mm, addr, pte, > >>>>>>>>> tlb->fullmm); > >>>>>>>>> + ptent =3D pte_mkold(pte= nt); > >>>>>>>>> + ptent =3D pte_mkclean(p= tent); > >>>>>>>>> + set_pte_at(mm, addr, pt= e, ptent); > >>>>>>>>> + tlb_remove_tlb_entry(tl= b, pte, > >>>>>>>>> addr); > >>>>>>>>> + } > >>>>>>>> > >>>>>>>> Can we do this in batches? for a CONT-PTE mapped large folio, yo= u are > >>>>>>>> unfolding > >>>>>>>> and folding again. It seems quite expensive. > >>>>> > >>>>> I'm not convinced we should be doing this in batches. We want the i= nitial > >>>>> folio_pte_batch() to be as loose as possible regarding permissions = so that we > >>>>> reduce our chances of splitting folios to the min. (e.g. ignore SW = bits like > >>>>> soft dirty, etc). I think it might be possible that some PTEs are R= O and other > >>>>> RW too (e.g. due to cow - although with the current cow impl, proba= bly not. > >>>>> But > >>>>> its fragile to assume that). Anyway, if we do an initial batch that= ignores > >>>>> all > >>>> > >>>> You are correct. I believe this scenario could indeed occur. For ins= tance, > >>>> if process A forks process B and then unmaps itself, leaving B as th= e > >>>> sole process owning the large folio. The current wp_page_reuse() fu= nction > >>>> will reuse PTE one by one while the specific subpage is written. > >>> > >>> Hmm - I thought it would only reuse if the total mapcount for the fol= io was 1. > >>> And since it is a large folio with each page mapped once in proc B, I= thought > >>> every subpage write would cause a copy except the last one? I haven't= looked at > >>> the code for a while. But I had it in my head that this is an area we= need to > >>> improve for mTHP. So sad I am wrong again =F0=9F=98=A2 > >> > >> wp_page_reuse() will currently reuse a PTE part of a large folio only = if > >> a single PTE remains mapped (refcount =3D=3D 0). > > > > ^ =3D=3D 1 seems this needs improvement. it is a waste the last subpage can reuse the whole large folio. i was doing it in a quite different way, if the large folio had only one subpage left, i would do copy and released the large folio[1]. and if i could reuse the whole large folio with CONT-PTE, i would reuse the whole large folio[2]. in mainline, we don't have this cont-pte luxury exposed to mm, so i guess we can not do [2] easily, but [1] seems to be an optimization. [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplu= s/sm8650_u_14.0.0_oneplus12/mm/memory.c#L3977 [2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplu= s/sm8650_u_14.0.0_oneplus12/mm/memory.c#L3812 > > Ahh yes. That's what I meant. I got the behacviour vagulely right though. > > Anyway, regardless, I'm not sure we want to batch here. Or if we do, we w= ant to > batch function that will only clear access and dirty. > Thanks Barry