From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2627EC48BF6 for ; Thu, 7 Mar 2024 07:00:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7FF246B0111; Thu, 7 Mar 2024 02:00:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7871F6B0112; Thu, 7 Mar 2024 02:00:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 600676B0113; Thu, 7 Mar 2024 02:00:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 48DE56B0111 for ; Thu, 7 Mar 2024 02:00:52 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id B66491A0A64 for ; Thu, 7 Mar 2024 07:00:51 +0000 (UTC) X-FDA: 81869345502.16.5919EE7 Received: from mail-ua1-f43.google.com (mail-ua1-f43.google.com [209.85.222.43]) by imf16.hostedemail.com (Postfix) with ESMTP id 201A4180009 for ; Thu, 7 Mar 2024 07:00:49 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=RQJYxjiN; spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.43 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709794850; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kNTeL6IKZ5heYcCREvrxCiHE/iraw/jJfZmDBZOkIYI=; b=PUgBA9ZSIjXL3Ho0BQd+dnmi2Z/j4IIR1IGuQge80idV0Uo9RPXtJ7uYYlUePnIUEJV1Yh 6A755igomNrV/tY2jCrvE3W+ltmPGtkPixwsqbFqL60I69/1YAH1P6N+bo/yU1ulu2B7eM pit+IhujO6jQ5ggpiM1sjXcwPPj0hSk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709794850; a=rsa-sha256; cv=none; b=ZvHOHgMv3hs3li95+AVspBa4XOnxb4X00QQsyPJihpvB7C0s+ldThDQz75OE3TSRmr3SHh ZWKN+huoOG0hNdBkrfNoBwf4PzCm/BOIeGdl2Avc9w+IOmz/Bb1oaDgSrNp5uSMIRIy1aH zMgM+1cgaCfEOBq+qXApVCVoTkfeYrM= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=RQJYxjiN; spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.43 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ua1-f43.google.com with SMTP id a1e0cc1a2514c-7db83d59c6eso240588241.3 for ; Wed, 06 Mar 2024 23:00:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709794849; x=1710399649; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=kNTeL6IKZ5heYcCREvrxCiHE/iraw/jJfZmDBZOkIYI=; b=RQJYxjiNTWk4u7OqFVid3JE6xNkXTy9579NqJawIA/k4LlqhSzpIFBUqh+WI8SlgNj skV/llesN27Y35xrrDT0+7Am22EW11IBL/iI2kAPAfmCHAo4nXvcO0Or/308y5iBUeHl tAiTSmK9zawX5LKCdZgTDZzbdzKt4AzxPYV1RFhZ/gmdAM3lveGGt2O2aNnIbkWGi23J m5gnbfDzc6C9iEhnrNdtdfx5HPIHEBYUyo7dmODgRvfxQpd+SCnR8YkXeBGYoIoAy7Tx 6UCBRCvhfuPNbypP2KLxwPFMxY48ZjsPNTZ3WuUsVz6LRAPlfFYT9K0tqypPR0NKcWCb VjVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709794849; x=1710399649; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kNTeL6IKZ5heYcCREvrxCiHE/iraw/jJfZmDBZOkIYI=; b=GzfpyKnjrqJV9Ob/6CgtH+OcKT54RnO6S9XyJNvKqcYFKAajbI7qi4ZECv95CR8xxd 8HmCPMGmK+40AvxHDbnGYDBPZUPjDgisVqaAJs6pLY8BYLcQicW+TsAAZeOb1+TKKbP6 YSklYgR5FEeIn7pfdk3aLK1bc5q1FQaq7gqcXTB22/TJy12nAt4PWXptQhkMqDiEa//5 ZlS4qISrl9+j64HhfI48dl1rn6ZetXb8mOn+Dkhy9BrOTu5jRretE00WTPKCIqXmUELF qHxYdjpNlAJCEUFX0F/kcqaguVFmqIVU2gQfVHhGFE7N+jXJnc1faSQ6HkAq6iDxhtyz rAdg== X-Forwarded-Encrypted: i=1; AJvYcCVqaOhk5YBpgVAVcF98izMn9oYUapYC/XzpZH4HVVqaLr9BEDwclT1HhWImQzjMPaHStNeS+YwQFEnP6qQw5141N5A= X-Gm-Message-State: AOJu0YybbChxS8tHi9mA6kMDMTJ00vCwSyKvVYHxSSzAeqSoAWbeH0u+ CztQMvKTkhY5R4BR3y1NaWUacQ+FKGvCAHjLgVjA5cEPZVfxOZ/QYW8lNJb8XZOUGLMDLJTB+IG k4l89KI8DTDIrRX0p8IHOUkp3fow= X-Google-Smtp-Source: AGHT+IGhiMcsWbBj8ZtwzznNYVEk1GIbrmzL/JeNsapt2JP3sPZIyrvOAlFwGihIFpebgJeuy7q6I/p+5fydqIPhjFc= X-Received: by 2002:a05:6122:2908:b0:4d3:36b9:2c26 with SMTP id fm8-20020a056122290800b004d336b92c26mr7415048vkb.14.1709794849085; Wed, 06 Mar 2024 23:00:49 -0800 (PST) MIME-Version: 1.0 References: <20240307061425.21013-1-ioworker0@gmail.com> In-Reply-To: <20240307061425.21013-1-ioworker0@gmail.com> From: Barry Song <21cnbao@gmail.com> Date: Thu, 7 Mar 2024 20:00:37 +1300 Message-ID: Subject: Re: [PATCH v2 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free To: Lance Yang Cc: akpm@linux-foundation.org, zokeefe@google.com, ryan.roberts@arm.com, shy828301@gmail.com, david@redhat.com, mhocko@suse.com, fengwei.yin@intel.com, xiehuan09@gmail.com, wangkefeng.wang@huawei.com, songmuchun@bytedance.com, peterx@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 201A4180009 X-Rspam-User: X-Stat-Signature: zx8yicpwjnuqwxwoffkgf917zgyhsb96 X-Rspamd-Server: rspam03 X-HE-Tag: 1709794849-752303 X-HE-Meta: U2FsdGVkX1/bbFg8t7E2/FPhb+JrYBY2kCwcjQ2XA3EG/kOwV7VaGqscuRJk8MHyGb6H2k2OpY+lsn3LOfaSAdzuRwxviE+9u0azjRLR7WgA65D8ecN04w3c84Kd0nOFXr3Nu+zd51HlzNeEoBaDZx1jQ+RYc+labAh5uli0BfAJip6ZddAfe5Lh1vIOTIUcdNpY1VRI6zRwj2cO1AhqEy/E7VMGtpRndeIhbPlssCHDHhZW0Xg4z0ji0HmOLQUmYnUcVdsQOLDu+V3VylZNIF+TZ4KB8EonTz6cglzAtS5PAJNhS0GWJnGirZG4/AGxApY9NIzzz+ZZ1o3LPy0zIrai7+8ITf/dvffMKXOaDljJYUp7WEWr/hZm9/lUZ9j9gY6+5W/E+YfzTuN67dBqLJTO2rhur68mIQaWtJ9+8ckXdsLVWaitPxc7zNgEe7YnpHFOXkf3uv1iDjimec+lbD1xh6fsvbIEF+DE2NmcC9HAOQQqKXdwMDzMxPkPPtcomD7PS/co+RIj0ukkQgyzV+v64As6dE+hhDK2l+no5eMw47IiDRwp5dG0bsQIfph79XOGjJLeX73CIGWHBD9Z6n9tInTtgSY5G/i6cmpiecgTDqg9b3piECFT9PQJTHRwWhzDWIn9u+xQIVueYiaeRAlHmnNauXSMwTyUH/w4qfSP0OSKBGjqbVU/vp9pu342ffZgSL02+ai0/hokqfFLt78ts/A67rduFUgibX9BJ/HAB/ahTUWv9Dfv0DdxexNAwdlGtuy2kDC++07aGAiSqh2HxLDHU9DkT0DmTKpx1dUxEP/JXCpKQm0lNv6TAcSYteH3R05TGQ7ZeE3hNYuIGvg0BSVifw6/TQDkKend+6BbBtNVGev8g+Fn2965bZT8t2IkxNp2w1/WERLBJWL5C119tiZu95CE2LbUR6ahyDHnNBogfMVlGKraJSy71DZAMMt0wDUPZPQ1BJL+iLq rlGhMtoj 6fK06JWq/GHBFucVfEpRiq1Yhc7grfuYayoS64UqVlAW/03fkNo0NVE+7m4YYt1dJy3qt9JBmtlUgxtF5NE1b43m5TQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 7, 2024 at 7:15=E2=80=AFPM Lance Yang wro= te: > > This patch optimizes lazyfreeing with PTE-mapped mTHP[1] > (Inspired by David Hildenbrand[2]). We aim to avoid unnecessary > folio splitting if the large folio is entirely within the given > range. > > On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by > PTE-mapped folios of the same size results in the following > runtimes for madvise(MADV_FREE) in seconds (shorter is better): > > Folio Size | Old | New | Change > ------------------------------------------ > 4KiB | 0.590251 | 0.590259 | 0% > 16KiB | 2.990447 | 0.185655 | -94% > 32KiB | 2.547831 | 0.104870 | -95% > 64KiB | 2.457796 | 0.052812 | -97% > 128KiB | 2.281034 | 0.032777 | -99% > 256KiB | 2.230387 | 0.017496 | -99% > 512KiB | 2.189106 | 0.010781 | -99% > 1024KiB | 2.183949 | 0.007753 | -99% > 2048KiB | 0.002799 | 0.002804 | 0% > > [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.c= om > [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat= .com/ > > Signed-off-by: Lance Yang > --- > v1 -> v2: > * Update the performance numbers > * Update the changelog, suggested by Ryan Roberts > * Check the COW folio, suggested by Yin Fengwei > * Check if we are mapping all subpages, suggested by Barry Song, > David Hildenbrand, Ryan Roberts > * https://lore.kernel.org/linux-mm/20240225123215.86503-1-ioworker0@gmai= l.com/ > > mm/madvise.c | 85 +++++++++++++++++++++++++++++++++++++++++++++------- > 1 file changed, 74 insertions(+), 11 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 44a498c94158..1437ac6eb25e 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -616,6 +616,20 @@ static long madvise_pageout(struct vm_area_struct *v= ma, > return 0; > } > > +static inline bool can_mark_large_folio_lazyfree(unsigned long addr, > + struct folio *folio, pte= _t *start_pte) > +{ > + int nr_pages =3D folio_nr_pages(folio); > + fpb_t flags =3D FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; > + > + for (int i =3D 0; i < nr_pages; i++) > + if (page_mapcount(folio_page(folio, i)) !=3D 1) > + return false; we have moved to folio_estimated_sharers though it is not precise, so we don't do this check with lots of loops and depending on the subpage's mapcount. BTW, do we need to rebase our work against David's changes[1]? [1] https://lore.kernel.org/linux-mm/20240227201548.857831-1-david@redhat.c= om/ > + > + return nr_pages =3D=3D folio_pte_batch(folio, addr, start_pte, > + ptep_get(start_pte), nr_pages, f= lags, NULL); > +} > + > static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > unsigned long end, struct mm_walk *walk) > > @@ -676,11 +690,45 @@ static int madvise_free_pte_range(pmd_t *pmd, unsig= ned long addr, > */ > if (folio_test_large(folio)) { > int err; > + unsigned long next_addr, align; > > - if (folio_estimated_sharers(folio) !=3D 1) > - break; > - if (!folio_trylock(folio)) > - break; > + if (folio_estimated_sharers(folio) !=3D 1 || > + !folio_trylock(folio)) > + goto skip_large_folio; I don't think we can skip all the PTEs for nr_pages, as some of them might = be pointing to other folios. for example, for a large folio with 16PTEs, you do MADV_DONTNEED(15-16), and write the memory of PTE15 and PTE16, you get page faults, thus PTE15 and PTE16 will point to two different small folios. We can only skip when w= e are sure nr_pages =3D=3D folio_pte_batch() is sure. > + > + align =3D folio_nr_pages(folio) * PAGE_SIZE; > + next_addr =3D ALIGN_DOWN(addr + align, align); > + > + /* > + * If we mark only the subpages as lazyfree, or > + * cannot mark the entire large folio as lazyfree= , > + * then just split it. > + */ > + if (next_addr > end || next_addr - addr !=3D alig= n || > + !can_mark_large_folio_lazyfree(addr, folio, p= te)) > + goto split_large_folio; > + > + /* > + * Avoid unnecessary folio splitting if the large > + * folio is entirely within the given range. > + */ > + folio_clear_dirty(folio); > + folio_unlock(folio); > + for (; addr !=3D next_addr; pte++, addr +=3D PAGE= _SIZE) { > + ptent =3D ptep_get(pte); > + if (pte_young(ptent) || pte_dirty(ptent))= { > + ptent =3D ptep_get_and_clear_full= ( > + mm, addr, pte, tlb->fullm= m); > + ptent =3D pte_mkold(ptent); > + ptent =3D pte_mkclean(ptent); > + set_pte_at(mm, addr, pte, ptent); > + tlb_remove_tlb_entry(tlb, pte, ad= dr); > + } Can we do this in batches? for a CONT-PTE mapped large folio, you are unfol= ding and folding again. It seems quite expensive. > + } > + folio_mark_lazyfree(folio); > + goto next_folio; > + > +split_large_folio: > folio_get(folio); > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(start_pte, ptl); > @@ -688,13 +736,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsig= ned long addr, > err =3D split_folio(folio); > folio_unlock(folio); > folio_put(folio); > - if (err) > - break; > - start_pte =3D pte =3D > - pte_offset_map_lock(mm, pmd, addr, &ptl); > - if (!start_pte) > - break; > - arch_enter_lazy_mmu_mode(); > + > + /* > + * If the large folio is locked or cannot be spli= t, > + * we just skip it. > + */ > + if (err) { > +skip_large_folio: > + if (next_addr >=3D end) > + break; > + pte +=3D (next_addr - addr) / PAGE_SIZE; > + addr =3D next_addr; > + } > + > + if (!start_pte) { > + start_pte =3D pte =3D pte_offset_map_lock= ( > + mm, pmd, addr, &ptl); > + if (!start_pte) > + break; > + arch_enter_lazy_mmu_mode(); > + } > + > +next_folio: > pte--; > addr -=3D PAGE_SIZE; > continue; > -- > 2.33.1 > Thanks Barry