Re: [PATCH v2 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Barry Song <21cnbao@gmail.com>
To: Lance Yang <ioworker0@gmail.com>
Cc: akpm@linux-foundation.org, zokeefe@google.com,
	ryan.roberts@arm.com,  shy828301@gmail.com, david@redhat.com,
	mhocko@suse.com, fengwei.yin@intel.com,  xiehuan09@gmail.com,
	wangkefeng.wang@huawei.com, songmuchun@bytedance.com,
	 peterx@redhat.com, minchan@kernel.org, linux-mm@kvack.org,
	 linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free
Date: Thu, 7 Mar 2024 20:00:37 +1300	[thread overview]
Message-ID: <CAGsJ_4xcRvZGdpPh1qcFTnTnDUbwz6WreQ=L_UO+oU2iFm9EPg@mail.gmail.com> (raw)
In-Reply-To: <20240307061425.21013-1-ioworker0@gmail.com>

On Thu, Mar 7, 2024 at 7:15 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> This patch optimizes lazyfreeing with PTE-mapped mTHP[1]
> (Inspired by David Hildenbrand[2]). We aim to avoid unnecessary
> folio splitting if the large folio is entirely within the given
> range.
>
> On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by
> PTE-mapped folios of the same size results in the following
> runtimes for madvise(MADV_FREE) in seconds (shorter is better):
>
> Folio Size |   Old    |   New    | Change
> ------------------------------------------
>       4KiB | 0.590251 | 0.590259 |    0%
>      16KiB | 2.990447 | 0.185655 |  -94%
>      32KiB | 2.547831 | 0.104870 |  -95%
>      64KiB | 2.457796 | 0.052812 |  -97%
>     128KiB | 2.281034 | 0.032777 |  -99%
>     256KiB | 2.230387 | 0.017496 |  -99%
>     512KiB | 2.189106 | 0.010781 |  -99%
>    1024KiB | 2.183949 | 0.007753 |  -99%
>    2048KiB | 0.002799 | 0.002804 |    0%
>
> [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com
> [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/
>
> Signed-off-by: Lance Yang <ioworker0@gmail.com>
> ---
> v1 -> v2:
>  * Update the performance numbers
>  * Update the changelog, suggested by Ryan Roberts
>  * Check the COW folio, suggested by Yin Fengwei
>  * Check if we are mapping all subpages, suggested by Barry Song,
>  David Hildenbrand, Ryan Roberts
>  * https://lore.kernel.org/linux-mm/20240225123215.86503-1-ioworker0@gmail.com/
>
>  mm/madvise.c | 85 +++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 74 insertions(+), 11 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 44a498c94158..1437ac6eb25e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -616,6 +616,20 @@ static long madvise_pageout(struct vm_area_struct *vma,
>         return 0;
>  }
>
> +static inline bool can_mark_large_folio_lazyfree(unsigned long addr,
> +                                                struct folio *folio, pte_t *start_pte)
> +{
> +       int nr_pages = folio_nr_pages(folio);
> +       fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +
> +       for (int i = 0; i < nr_pages; i++)
> +               if (page_mapcount(folio_page(folio, i)) != 1)
> +                       return false;

we have moved to folio_estimated_sharers though it is not precise, so
we don't do
this check with lots of loops and depending on the subpage's mapcount.
BTW, do we need to rebase our work against David's changes[1]?
[1] https://lore.kernel.org/linux-mm/20240227201548.857831-1-david@redhat.com/

> +
> +       return nr_pages == folio_pte_batch(folio, addr, start_pte,
> +                                        ptep_get(start_pte), nr_pages, flags, NULL);
> +}
> +
>  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                                 unsigned long end, struct mm_walk *walk)
>
> @@ -676,11 +690,45 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                  */
>                 if (folio_test_large(folio)) {
>                         int err;
> +                       unsigned long next_addr, align;
>
> -                       if (folio_estimated_sharers(folio) != 1)
> -                               break;
> -                       if (!folio_trylock(folio))
> -                               break;
> +                       if (folio_estimated_sharers(folio) != 1 ||
> +                           !folio_trylock(folio))
> +                               goto skip_large_folio;


I don't think we can skip all the PTEs for nr_pages, as some of them might be
pointing to other folios.

for example, for a large folio with 16PTEs, you do MADV_DONTNEED(15-16),
and write the memory of PTE15 and PTE16, you get page faults, thus PTE15
and PTE16 will point to two different small folios. We can only skip when we
are sure nr_pages == folio_pte_batch() is sure.

> +
> +                       align = folio_nr_pages(folio) * PAGE_SIZE;
> +                       next_addr = ALIGN_DOWN(addr + align, align);
> +
> +                       /*
> +                        * If we mark only the subpages as lazyfree, or
> +                        * cannot mark the entire large folio as lazyfree,
> +                        * then just split it.
> +                        */
> +                       if (next_addr > end || next_addr - addr != align ||
> +                           !can_mark_large_folio_lazyfree(addr, folio, pte))
> +                               goto split_large_folio;
> +
> +                       /*
> +                        * Avoid unnecessary folio splitting if the large
> +                        * folio is entirely within the given range.
> +                        */
> +                       folio_clear_dirty(folio);
> +                       folio_unlock(folio);
> +                       for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
> +                               ptent = ptep_get(pte);
> +                               if (pte_young(ptent) || pte_dirty(ptent)) {
> +                                       ptent = ptep_get_and_clear_full(
> +                                               mm, addr, pte, tlb->fullmm);
> +                                       ptent = pte_mkold(ptent);
> +                                       ptent = pte_mkclean(ptent);
> +                                       set_pte_at(mm, addr, pte, ptent);
> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> +                               }

Can we do this in batches? for a CONT-PTE mapped large folio, you are unfolding
and folding again. It seems quite expensive.

> +                       }
> +                       folio_mark_lazyfree(folio);
> +                       goto next_folio;
> +
> +split_large_folio:
>                         folio_get(folio);
>                         arch_leave_lazy_mmu_mode();
>                         pte_unmap_unlock(start_pte, ptl);
> @@ -688,13 +736,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                         err = split_folio(folio);
>                         folio_unlock(folio);
>                         folio_put(folio);
> -                       if (err)
> -                               break;
> -                       start_pte = pte =
> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> -                       if (!start_pte)
> -                               break;
> -                       arch_enter_lazy_mmu_mode();
> +
> +                       /*
> +                        * If the large folio is locked or cannot be split,
> +                        * we just skip it.
> +                        */
> +                       if (err) {
> +skip_large_folio:
> +                               if (next_addr >= end)
> +                                       break;
> +                               pte += (next_addr - addr) / PAGE_SIZE;
> +                               addr = next_addr;
> +                       }
> +
> +                       if (!start_pte) {
> +                               start_pte = pte = pte_offset_map_lock(
> +                                       mm, pmd, addr, &ptl);
> +                               if (!start_pte)
> +                                       break;
> +                               arch_enter_lazy_mmu_mode();
> +                       }
> +
> +next_folio:
>                         pte--;
>                         addr -= PAGE_SIZE;
>                         continue;
> --
> 2.33.1
>

Thanks
Barry

next prev parent reply	other threads:[~2024-03-07  7:00 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-07  6:14 Lance Yang
2024-03-07  7:00 ` Barry Song [this message]
2024-03-07  8:00   ` Lance Yang
2024-03-07  8:10     ` Barry Song
2024-03-07  9:07       ` Ryan Roberts
2024-03-07  9:33         ` Barry Song
2024-03-07 10:50           ` Ryan Roberts
2024-03-07 10:54             ` David Hildenbrand
2024-03-07 10:54               ` David Hildenbrand
2024-03-07 11:13                 ` Ryan Roberts
2024-03-07 11:17                   ` David Hildenbrand
2024-03-07 14:41                     ` Lance Yang
2024-03-07 14:58                       ` David Hildenbrand
2024-03-07 15:08                         ` Lance Yang
2024-03-07 11:26                   ` Barry Song
2024-03-07 11:31                     ` David Hildenbrand
2024-03-07 11:42                       ` Ryan Roberts
2024-03-07 11:45                         ` David Hildenbrand
2024-03-07 12:01                           ` Barry Song
2024-03-07 12:04                             ` David Hildenbrand
2024-03-07 16:31                             ` Ryan Roberts
2024-03-07 18:54                               ` Barry Song
2024-03-07 19:48                                 ` David Hildenbrand
2024-03-08 13:05                                 ` Ryan Roberts
2024-03-08 13:27                                   ` David Hildenbrand
2024-03-08 13:48                                     ` Ryan Roberts
2024-03-08 18:01                                   ` Barry Song
2024-03-11  9:55                                     ` Ryan Roberts
2024-03-11 10:01                                       ` Barry Song
2024-03-11 15:07         ` Ryan Roberts
2024-03-12 10:20           ` Lance Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4xcRvZGdpPh1qcFTnTnDUbwz6WreQ=L_UO+oU2iFm9EPg@mail.gmail.com' \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=fengwei.yin@intel.com \
    --cc=ioworker0@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=peterx@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=xiehuan09@gmail.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox