From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2B4D7C48BF6 for ; Mon, 26 Feb 2024 04:00:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9884A6B0155; Sun, 25 Feb 2024 23:00:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 910D96B0156; Sun, 25 Feb 2024 23:00:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D84C6B0157; Sun, 25 Feb 2024 23:00:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 61C6F6B0155 for ; Sun, 25 Feb 2024 23:00:58 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id E3168140627 for ; Mon, 26 Feb 2024 04:00:57 +0000 (UTC) X-FDA: 81832604154.04.30FBC8F Received: from mail-ua1-f46.google.com (mail-ua1-f46.google.com [209.85.222.46]) by imf05.hostedemail.com (Postfix) with ESMTP id 16DD510000A for ; Mon, 26 Feb 2024 04:00:55 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=TZrXdBhD; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.46 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708920056; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Wx6scJy++Ki5ZJPsfnA0F/ArC+ZMcpgrqxfkAiYtClk=; b=Myv/jEKgwps8qnwRchn/3hfyEzHBnq+ivpp04/1CT6htYonWVM8YTZXy/DFy51kqiELzWt LhgizR4Vwe0oEEB9QhVsORly2dzEH68w1YdO88zkFdVVdWpWyjh3mHn2L0S8ZzROYojdRp xkclUuda/vtqlMOJLHeEfXglOawSXOQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708920056; a=rsa-sha256; cv=none; b=lku81dlywMiAqnbNt+08Dk9wCEHI+O3upIiAfEn5wNjO7cS7PdrZkLTc3tOzV5sStbjYDF MQha1DyiTUB+SI9e24HAnsRkl7hejLOhUkKk0/f1Whqp0VWMBBh83CIPjDzEXQNptKAWfC /LMC5mF983FfnnRHS/aX79JKKH/oDbQ= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=TZrXdBhD; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.46 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ua1-f46.google.com with SMTP id a1e0cc1a2514c-7d5a6b1dd60so1207189241.3 for ; Sun, 25 Feb 2024 20:00:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708920055; x=1709524855; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Wx6scJy++Ki5ZJPsfnA0F/ArC+ZMcpgrqxfkAiYtClk=; b=TZrXdBhDHqfhex0sk4j7/f1+kVIHDQFAvRW34bKTICDufj/KUoTYLoSPwzL1U4ecHm 9IiGnA33jWrvs3t/TMjL7mYDenxp8zkLXHs6h1CXl/I7UV4gMNY6zMwKIGNLb/vmGTvW K8Sm4FMUBAa9GlCiMhvQ50U1fZwreKqPCVGpqQHS9SU4NALpfU9ToGLc1fio+zD4RsrA sHCmBfOA3ciwX3jjiH78kJ8FxNJtClxl7RyyEfOozYWe5iNJCOvSoIGylmKXXrCZ1Vwb AwlwzA8Wt+UsN9scrVGRHJvUhFOcD/kzbLbq8fW1xfu/Ay1Wusxfpv4nafnEhDlcIQgk 7AMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708920055; x=1709524855; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Wx6scJy++Ki5ZJPsfnA0F/ArC+ZMcpgrqxfkAiYtClk=; b=TdlupzQF06tMge89R/UeC808R9HHMkK9W75T2jceAaJn1gk1ThBrtfLUNeydBk2nj2 z6oWeybbNSFoiVvEJvvhxQcAyb+Hpo3QD+0hdXOd1Hbj6eZqHfXG2fJDQD5mbuLMym3E sI1Qnl3a2BtjUWn7Lueaq4SfUZF6UTKOaWw0WdYj0Xf23rHgWTcUUjdT3yDxfbxl5zMD HCyFHA4Vtw2/dYFYsIfFb78Lh+eS1l3pvNni8MAVX96WPI6spWJuIcWzn8hTl3dHvLf/ gHrthCo5LYhOWUvylsD+utIFJVavTZZkZLyaNOq262qrh2zactG4aXsZsWvGTqu9x9xP 82Gg== X-Forwarded-Encrypted: i=1; AJvYcCVi+UtqYXsQr2HRt0W9gXlVxdBeTJXhm21M1CYe2wiXEyvRxTFoPY2XGcT5Vj0Lpfnt3FeMzp8gijwut+a05LjjudY= X-Gm-Message-State: AOJu0YxtWyZZCHgdsuDbEuvUHQJKAkX9x9ONklKsaFNoZa1u2qxOaPEV xLkxTQWAYG2Ua/VpmnpzUGIo89uhYfS7/0KyliOxzNU2kFTj4yQP/EkCMNxveSH66UVgNZW67YT vUofj3+Bz7+wePzGxSrApAsP9Iwg= X-Google-Smtp-Source: AGHT+IHRTq8LjYB5t3A9k73Cd2X7/8pF9CIj1efeI4Cgnzl8bChG5hssDoUkqsBIDvGLA1HwW/C7eS3idXmrrt5jkgw= X-Received: by 2002:a1f:4bc2:0:b0:4ca:15b2:a09b with SMTP id y185-20020a1f4bc2000000b004ca15b2a09bmr2735352vka.13.1708920054985; Sun, 25 Feb 2024 20:00:54 -0800 (PST) MIME-Version: 1.0 References: <20240225123215.86503-1-ioworker0@gmail.com> In-Reply-To: <20240225123215.86503-1-ioworker0@gmail.com> From: Barry Song <21cnbao@gmail.com> Date: Mon, 26 Feb 2024 17:00:43 +1300 Message-ID: Subject: Re: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free To: Lance Yang Cc: akpm@linux-foundation.org, zokeefe@google.com, shy828301@gmail.com, david@redhat.com, mhocko@suse.com, ryan.roberts@arm.com, wangkefeng.wang@huawei.com, songmuchun@bytedance.com, peterx@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 16DD510000A X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: b13w519wr3qhj7oj5yct458f9a56nqr8 X-HE-Tag: 1708920055-105915 X-HE-Meta: U2FsdGVkX1/WHWKEF69zvfMCs8Q/hYQr/N6/eTMLtq0rZ3Byr9Nl4kFEQ4yNyXXa6/ehakY3he3m1NzD7i8AFSaxCLg0hCeVXCCo7L1zKBdDZo7X4bvF5GFxJo1Y+oJuaoDqZeKiy/isu6IcGBoYfIeZINBig9up8TdXJFve1SLMPQ1oR1OrjszboPo0RZJJE7VesD2fG8ve34SyZwEKPDoEJMTkzn60zFGaEtBM/Ganp1YTmw+7rmURlXSJ2KIBrVQKdU6RIDpXBoymaLzUhGNpCp9tG2B8U5YHl4hdbIwdoVWFfKrOn2Kx9aaJ39xvzAiqJsSTb8EzfaDuCmEHy9EnPe1GhxhXGqsDEYw5A2mdKjXOcA+qiUg5bDpZZTgwaLEWeLGeY3rzBFxCG+GcCYboZjOy88/WTy9mrkFd1J2cMjzbUghJ0dMw7xY6ozfimRLb0tQv15ydBNx3G0CGZ4JiwKvEmCTXS2SA0Pvhrex6KB6sSuYOuzuwJ8spBfxM+H9Yz7cdemR3l6/LHdzlaY5bg6Q1+CYbzWmqThtI4j9efwXCDmBrFGQYoFP6L42hWWfUrfEsaC/m9sFldmDjbl/lK1MlRA2pTH8I0VCVU4sO/ePuyWRaQxrwysqisgTA/FVXEYQKU2y/hNLlIy/2mqfAVgyuvZ27duuP+ZTjPMFb0pfdYoHkeyIpch8tOuW0ORqrtdIg4qWkjuJqgEJwkU38zw2FVxHAQBTA8QfMOW+Yp30ywwAlI6DJj6n2bwqsiBo4xUHjh+m0xJz/eb2Ko8EBwlQZyGaxa7RTj2vR85RD+KT/W2kHl7oDeIC1A4ZW8Jqv/kIpc9p5wXVtJ2Fxc6BtWVyZxAXgmqdaROKgcV2uBqSUjViWMbIVFKoEDIEihd9PKbcv2PJsJlrt2ORAr56ojajHk3EP4KDLuEgqvJaT7NWTlhTWeLAHtoFELCLPsdSAm4D99rNJT0ewgUe NWA27kD6 DQaupU8QCKRN2LlKCjVSDuWOZSCeYUv2/RkFHVAqijMKKVz8QXH4Yor5vclfdjeAeej7w7zr1JdB6/IFKYhn1T8Z/TvHWGy8yX9d1zCz0T68Pe0ifI1+atEHtpPPe5GKJq+1meh5mKT7P/FzxafQo8KP+MV1jIIRXGVbKp4FwBr9SKL9CljZyg5SbY8i4iQDamGkEPLiSYggfqO3jz5eTNJuSHuUmhV/xTNQjVfv6IZAqSCBJL5RFecUDCdu1v4HCBEBJBj+dd9SIT3d0qSzQM6SB3belSjOoPVODIq+/JDMJ2h7BhlEjFks8LlBqRIYgAXI98KJBBCq7UCf06MyrmLJiofCOB1giVFvC684Twqkm8RTTSaqB1d4jAA4wx2m1aBswpXM2TGfjMafjsLvkH/JcA6tBwDx30OszDZjBdsrvvhiFiZw3e0z/2EQeyUmPDt5WifSI4wcZOYC/rRK9YE5/FHZKzTwz3PsdesK0xGAQBHUkNsCDrFDP7hMAt7c7oGTlkp5L1pjg8RX1wKDUVz4It9SJ2NzqHMf79I0SVD4XPk+Hs44kQ1N7FA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Lance, On Mon, Feb 26, 2024 at 1:33=E2=80=AFAM Lance Yang wr= ote: > > This patch improves madvise_free_pte_range() to correctly > handle large folio that is smaller than PMD-size > (for example, 16KiB to 1024KiB[1]). It=E2=80=99s probably part of > the preparation to support anonymous multi-size THP. > > Additionally, when the consecutive PTEs are mapped to > consecutive pages of the same large folio (mTHP), if the > folio is locked before madvise(MADV_FREE) or cannot be > split, then all subsequent PTEs within the same PMD will > be skipped. However, they should have been MADV_FREEed. > > Moreover, this patch also optimizes lazyfreeing with > PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We > aim to avoid unnecessary folio splitting if the large > folio is entirely within the given range. > We did something similar on MADV_PAGEOUT[1] [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.c= om/ > On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by > PTE-mapped folios of the same size results in the following > runtimes for madvise(MADV_FREE) in seconds (shorter is better): > > Folio Size | Old | New | Change > ---------------------------------------------- > 4KiB | 0.590251 | 0.590264 | 0% > 16KiB | 2.990447 | 0.182167 | -94% > 32KiB | 2.547831 | 0.101622 | -96% > 64KiB | 2.457796 | 0.049726 | -98% > 128KiB | 2.281034 | 0.030109 | -99% > 256KiB | 2.230387 | 0.015838 | -99% > 512KiB | 2.189106 | 0.009149 | -99% > 1024KiB | 2.183949 | 0.006620 | -99% > 2048KiB | 0.002799 | 0.002795 | 0% > > [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.c= om > [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat= .com/ > > Signed-off-by: Lance Yang > --- > mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 58 insertions(+), 11 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index cfa5e7288261..bcbf56595a2e 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsig= ned long addr, > */ > if (folio_test_large(folio)) { > int err; > + unsigned long next_addr, align; > > - if (folio_estimated_sharers(folio) !=3D 1) > - break; > - if (!folio_trylock(folio)) > - break; > + if (folio_estimated_sharers(folio) !=3D 1 || > + !folio_trylock(folio)) > + goto skip_large_folio; > + > + align =3D folio_nr_pages(folio) * PAGE_SIZE; > + next_addr =3D ALIGN_DOWN(addr + align, align); > + > + /* > + * If we mark only the subpages as lazyfree, > + * split the large folio. > + */ > + if (next_addr > end || next_addr - addr !=3D alig= n) > + goto split_large_folio; > + > + /* > + * Avoid unnecessary folio splitting if the large > + * folio is entirely within the given range. > + */ > + folio_test_clear_dirty(folio); > + folio_unlock(folio); > + for (; addr !=3D next_addr; pte++, addr +=3D PAGE= _SIZE) { > + ptent =3D ptep_get(pte); > + if (pte_young(ptent) || pte_dirty(ptent))= { > + ptent =3D ptep_get_and_clear_full= ( > + mm, addr, pte, tlb->fullm= m); > + ptent =3D pte_mkold(ptent); > + ptent =3D pte_mkclean(ptent); > + set_pte_at(mm, addr, pte, ptent); > + tlb_remove_tlb_entry(tlb, pte, ad= dr); > + } The code works under the assumption the large folio is entirely mapped in all PTEs in the range. This is not always true. This won't work in some cases as some PTEs might be mapping to the large folios. some others might have been unmapped or mapped to different folios. so in MADV_PAGEOUT, we have a function to check the folio is really entirely mapped: +static inline bool pte_range_cont_mapped(unsigned long start_pfn, + pte_t *start_pte, unsigned long start_addr, int nr) +{ + int i; + pte_t pte_val; + + for (i =3D 0; i < nr; i++) { + pte_val =3D ptep_get(start_pte + i); + + if (pte_none(pte_val)) + return false; + + if (pte_pfn(pte_val) !=3D (start_pfn + i)) + return false; + } + + return true; +} > + } > + folio_mark_lazyfree(folio); > + goto next_folio; > + > +split_large_folio: > folio_get(folio); > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(start_pte, ptl); > @@ -688,13 +720,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsig= ned long addr, > err =3D split_folio(folio); > folio_unlock(folio); > folio_put(folio); > - if (err) > - break; > - start_pte =3D pte =3D > - pte_offset_map_lock(mm, pmd, addr, &ptl); > - if (!start_pte) > - break; > - arch_enter_lazy_mmu_mode(); > + > + /* > + * If the large folio is locked before madvise(MA= DV_FREE) > + * or cannot be split, we just skip it. > + */ > + if (err) { > +skip_large_folio: > + if (next_addr >=3D end) > + break; > + pte +=3D (next_addr - addr) / PAGE_SIZE; > + addr =3D next_addr; > + } > + > + if (!start_pte) { > + start_pte =3D pte =3D pte_offset_map_lock= ( > + mm, pmd, addr, &ptl); > + if (!start_pte) > + break; > + arch_enter_lazy_mmu_mode(); > + } > + > +next_folio: > pte--; > addr -=3D PAGE_SIZE; > continue; > -- > 2.33.1 > > Thanks Barry