linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: Yin Fengwei <fengwei.yin@intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	Lance Yang <ioworker0@gmail.com>,
	 David Hildenbrand <david@redhat.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	 linux-mm@kvack.org, mhocko@suse.com, minchan@kernel.org,
	peterx@redhat.com,  shy828301@gmail.com,
	songmuchun@bytedance.com, wangkefeng.wang@huawei.com,
	 zokeefe@google.com
Subject: Re: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free
Date: Tue, 27 Feb 2024 22:01:00 +1300	[thread overview]
Message-ID: <CAGsJ_4yqk6eU+rpym61TnSN_5c5=K+u1FPa5JP92WKxzfrkkcA@mail.gmail.com> (raw)
In-Reply-To: <05f2d04c-333d-4298-8c7a-d5adeac5df82@intel.com>

On Tue, Feb 27, 2024 at 9:33 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 2/27/24 15:54, Barry Song wrote:
> > On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >>
> >>
> >> On 2/27/24 15:21, Barry Song wrote:
> >>> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/27/24 14:40, Barry Song wrote:
> >>>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2/27/24 10:17, Barry Song wrote:
> >>>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but
> >>>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether
> >>>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be
> >>>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
> >>>>>>>>> can decide whether to split it or ignore it for current reclaim cycle).
> >>>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding
> >>>>>>>> folios multiple times to reclaim the list.
> >>>>>>>>
> >>>>>>>> I don't see too much difference between splitting in madvise and splitting
> >>>>>>>> in vmscan.  as our real purpose is avoiding splitting entirely mapped
> >>>>>>>> large folios. for partial mapped large folios, if we split in madvise, then
> >>>>>>>> we don't need to play the game of skipping folios while iterating PTEs.
> >>>>>>>> if we don't split in madvise, we have to make sure the large folio is only
> >>>>>>>> added in reclaimed list one time by checking if PTEs belong to the
> >>>>>>>> previous added folio.
> >>>>>>>
> >>>>>>> If the partial mapped large folio is unmapped from the range, the related PTE
> >>>>>>> become none. How could the folio be added to reclaimed list multiple times?
> >>>>>>
> >>>>>> in case we have 16 PTEs in a large folio.
> >>>>>> PTE0 present
> >>>>>> PTE1 present
> >>>>>> PTE2 present
> >>>>>> PTE3  none
> >>>>>> PTE4 present
> >>>>>> PTE5 none
> >>>>>> PTE6 present
> >>>>>> ....
> >>>>>> the current code is scanning PTE one by one.
> >>>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
> >>>>> No. Before detect the folio is fully mapped to the range, we can't add folio
> >>>>> to reclaim list because the partial mapped folio shouldn't be added. We can
> >>>>> only scan PTE15 and know it's fully mapped.
> >>>>
> >>>> you never know PTE15 is the last one mapping to the large folio, PTE15 can
> >>>> be mapping to a completely different folio with PTE0.
> >>>>
> >>>>>
> >>>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
> >>>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
> >>>>> become none.
> >>>>
> >>>> I don't understand why all 16PTEs become none as we set PTEs to none.
> >>>> we set PTEs to swap entries till try_to_unmap_one called by vmscan.
> >>>>
> >>>>>
> >>>>> If the large folio is fully mapped, the folio will be added to reclaim list
> >>>>> after scan PTE15 and know it's fully mapped.
> >>>>
> >>>> our approach is calling pte_batch_pte while meeting the first pte, if
> >>>> pte_batch_pte = 16,
> >>>> then we add this folio to reclaim_list and skip the left 15 PTEs.
> >>>
> >>> Let's compare two different implementation, for partial mapped large folio
> >>> with 8 PTEs as below,
> >>>
> >>> PTE0 present for large folio1
> >>> PTE1 present for large folio1
> >>> PTE2 present for another folio2
> >>> PTE3 present for another folio3
> >>> PTE4 present for large folio1
> >>> PTE5 present for large folio1
> >>> PTE6 present for another folio4
> >>> PTE7 present for another folio5
> >>>
> >>> If we don't split in madvise(depend on vmscan to split after adding
> >>> folio1), we will have
> >> Let me clarify something here:
> >>
> >> I prefer that we don't split large folio here. Instead, we unmap the
> >> large folio from this VMA range (I think you missed the unmap operation
> >> I mentioned).
> >
> > I don't understand why we unmap as this is a MADV_PAGEOUT not
> > an unmap. unmapping totally changes the semantics. Would you like
> > to show pseudo code?
> Oh. Yes. MADV_PAGEOUT is not suitable.
>
> What about MADV_FREE?

we can't unmap either. as MADV_FREE applies to anon vma.  while a
folio is marked lazyfree, we move anon folio to file LRU. if somebody
writes the folio afterwards, we take the folio back; if nobody writes it
before vmscan gets it in the file LRU, we can reclaim it by setting PTEs
to none. we can't immediately unmap a large folio at the time
MADV_FREE is called.  immediate unmap is the behavior of MADV_DONTNEED
but not MADV_FREE.

>
> >
> > for MADV_PAGEOUT on swap-out, the last step is writing swap entries
> > to replace PTEs which are present. I don't understand how an unmap
> > can be involved in this process.
> >
> >>
> >> The intention is trying best to avoid splitting the large folio. If
> >> the folio is only partially mapped to this VMA range, it's likely it
> >> will be reclaimed as whole large folio. Which brings benefit for lru
> >> and zone lock contention comparing to splitting large folio.
> >
> > which also brings negative side effects such as redundant I/O.
> > For example, if you have only one subpage left in a large folio,
> > pageout will still write nr_pages subpages into swap, then immediately
> > free them in swap.
> >
> >>
> >> The thing I am not sure is unmapping from specific VMA range is not
> >> available and whether it's worthy to add it.
> >
> > I think we might have the possibility to have some complex code to
> > add folio1, folio2, folio3, folio4 and folio5 in the above example into
> > reclaim_list while avoiding splitting folio1. but i really don't understand
> > how unmap will work.
> >
> >>
> >>> to make sure folio1, folio2, folio3, folio4, folio5 are added to
> >>> reclaim_list by doing a complex
> >>> game while scanning these 8 PTEs.
> >>>
> >>> if we split in madvise, they become:
> >>>
> >>> PTE0 present for large folioA  - splitted from folio 1
> >>> PTE1 present for large folioB - splitted from folio 1
> >>> PTE2 present for another folio2
> >>> PTE3 present for another folio3
> >>> PTE4 present for large folioC - splitted from folio 1
> >>> PTE5 present for large folioD - splitted from folio 1
> >>> PTE6 present for another folio4
> >>> PTE7 present for another folio5
> >>>
> >>> we simply add the above 8 folios into reclaim_list one by one.
> >>>
> >>> I would vote for splitting for partial mapped large folio in madvise.
> >>>
> >

Thanks
Barry


  reply	other threads:[~2024-02-27  9:01 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-25 12:32 Lance Yang
2024-02-26  2:38 ` Yin Fengwei
2024-02-26  8:35   ` Lance Yang
2024-02-26 12:57     ` Ryan Roberts
2024-02-26 13:03       ` David Hildenbrand
2024-02-26 13:47         ` Lance Yang
2024-02-26  4:00 ` Barry Song
2024-02-26  8:37   ` Lance Yang
2024-02-26  8:41     ` David Hildenbrand
2024-02-26  8:55       ` Lance Yang
2024-02-26 13:04         ` Ryan Roberts
2024-02-26 13:50           ` Lance Yang
2024-02-27  1:21             ` Barry Song
2024-02-27  1:48               ` Lance Yang
2024-02-27  2:12                 ` Barry Song
2024-02-27  2:15                   ` Lance Yang
2024-02-26 20:49           ` Barry Song
2024-02-27  1:51             ` Yin Fengwei
2024-02-27  2:17               ` Barry Song
2024-02-27  6:14                 ` Yin Fengwei
2024-02-27  6:40                   ` Barry Song
2024-02-27  6:42                     ` Barry Song
2024-02-27  7:02                     ` Yin Fengwei
2024-02-27  7:11                       ` Barry Song
2024-02-27  7:21                         ` Barry Song
2024-02-27  7:42                           ` Yin Fengwei
2024-02-27  7:54                             ` Barry Song
2024-02-27  8:33                               ` Yin Fengwei
2024-02-27  9:01                                 ` Barry Song [this message]
2024-02-26 13:00 ` Ryan Roberts
2024-02-26 13:54   ` Lance Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4yqk6eU+rpym61TnSN_5c5=K+u1FPa5JP92WKxzfrkkcA@mail.gmail.com' \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=fengwei.yin@intel.com \
    --cc=ioworker0@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=peterx@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox