From: Barry Song <21cnbao@gmail.com>
To: Yin Fengwei <fengwei.yin@intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
Lance Yang <ioworker0@gmail.com>,
David Hildenbrand <david@redhat.com>,
akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, mhocko@suse.com, minchan@kernel.org,
peterx@redhat.com, shy828301@gmail.com,
songmuchun@bytedance.com, wangkefeng.wang@huawei.com,
zokeefe@google.com
Subject: Re: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free
Date: Tue, 27 Feb 2024 22:01:00 +1300 [thread overview]
Message-ID: <CAGsJ_4yqk6eU+rpym61TnSN_5c5=K+u1FPa5JP92WKxzfrkkcA@mail.gmail.com> (raw)
In-Reply-To: <05f2d04c-333d-4298-8c7a-d5adeac5df82@intel.com>
On Tue, Feb 27, 2024 at 9:33 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 2/27/24 15:54, Barry Song wrote:
> > On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >>
> >>
> >> On 2/27/24 15:21, Barry Song wrote:
> >>> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/27/24 14:40, Barry Song wrote:
> >>>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2/27/24 10:17, Barry Song wrote:
> >>>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but
> >>>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether
> >>>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be
> >>>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
> >>>>>>>>> can decide whether to split it or ignore it for current reclaim cycle).
> >>>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding
> >>>>>>>> folios multiple times to reclaim the list.
> >>>>>>>>
> >>>>>>>> I don't see too much difference between splitting in madvise and splitting
> >>>>>>>> in vmscan. as our real purpose is avoiding splitting entirely mapped
> >>>>>>>> large folios. for partial mapped large folios, if we split in madvise, then
> >>>>>>>> we don't need to play the game of skipping folios while iterating PTEs.
> >>>>>>>> if we don't split in madvise, we have to make sure the large folio is only
> >>>>>>>> added in reclaimed list one time by checking if PTEs belong to the
> >>>>>>>> previous added folio.
> >>>>>>>
> >>>>>>> If the partial mapped large folio is unmapped from the range, the related PTE
> >>>>>>> become none. How could the folio be added to reclaimed list multiple times?
> >>>>>>
> >>>>>> in case we have 16 PTEs in a large folio.
> >>>>>> PTE0 present
> >>>>>> PTE1 present
> >>>>>> PTE2 present
> >>>>>> PTE3 none
> >>>>>> PTE4 present
> >>>>>> PTE5 none
> >>>>>> PTE6 present
> >>>>>> ....
> >>>>>> the current code is scanning PTE one by one.
> >>>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
> >>>>> No. Before detect the folio is fully mapped to the range, we can't add folio
> >>>>> to reclaim list because the partial mapped folio shouldn't be added. We can
> >>>>> only scan PTE15 and know it's fully mapped.
> >>>>
> >>>> you never know PTE15 is the last one mapping to the large folio, PTE15 can
> >>>> be mapping to a completely different folio with PTE0.
> >>>>
> >>>>>
> >>>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
> >>>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
> >>>>> become none.
> >>>>
> >>>> I don't understand why all 16PTEs become none as we set PTEs to none.
> >>>> we set PTEs to swap entries till try_to_unmap_one called by vmscan.
> >>>>
> >>>>>
> >>>>> If the large folio is fully mapped, the folio will be added to reclaim list
> >>>>> after scan PTE15 and know it's fully mapped.
> >>>>
> >>>> our approach is calling pte_batch_pte while meeting the first pte, if
> >>>> pte_batch_pte = 16,
> >>>> then we add this folio to reclaim_list and skip the left 15 PTEs.
> >>>
> >>> Let's compare two different implementation, for partial mapped large folio
> >>> with 8 PTEs as below,
> >>>
> >>> PTE0 present for large folio1
> >>> PTE1 present for large folio1
> >>> PTE2 present for another folio2
> >>> PTE3 present for another folio3
> >>> PTE4 present for large folio1
> >>> PTE5 present for large folio1
> >>> PTE6 present for another folio4
> >>> PTE7 present for another folio5
> >>>
> >>> If we don't split in madvise(depend on vmscan to split after adding
> >>> folio1), we will have
> >> Let me clarify something here:
> >>
> >> I prefer that we don't split large folio here. Instead, we unmap the
> >> large folio from this VMA range (I think you missed the unmap operation
> >> I mentioned).
> >
> > I don't understand why we unmap as this is a MADV_PAGEOUT not
> > an unmap. unmapping totally changes the semantics. Would you like
> > to show pseudo code?
> Oh. Yes. MADV_PAGEOUT is not suitable.
>
> What about MADV_FREE?
we can't unmap either. as MADV_FREE applies to anon vma. while a
folio is marked lazyfree, we move anon folio to file LRU. if somebody
writes the folio afterwards, we take the folio back; if nobody writes it
before vmscan gets it in the file LRU, we can reclaim it by setting PTEs
to none. we can't immediately unmap a large folio at the time
MADV_FREE is called. immediate unmap is the behavior of MADV_DONTNEED
but not MADV_FREE.
>
> >
> > for MADV_PAGEOUT on swap-out, the last step is writing swap entries
> > to replace PTEs which are present. I don't understand how an unmap
> > can be involved in this process.
> >
> >>
> >> The intention is trying best to avoid splitting the large folio. If
> >> the folio is only partially mapped to this VMA range, it's likely it
> >> will be reclaimed as whole large folio. Which brings benefit for lru
> >> and zone lock contention comparing to splitting large folio.
> >
> > which also brings negative side effects such as redundant I/O.
> > For example, if you have only one subpage left in a large folio,
> > pageout will still write nr_pages subpages into swap, then immediately
> > free them in swap.
> >
> >>
> >> The thing I am not sure is unmapping from specific VMA range is not
> >> available and whether it's worthy to add it.
> >
> > I think we might have the possibility to have some complex code to
> > add folio1, folio2, folio3, folio4 and folio5 in the above example into
> > reclaim_list while avoiding splitting folio1. but i really don't understand
> > how unmap will work.
> >
> >>
> >>> to make sure folio1, folio2, folio3, folio4, folio5 are added to
> >>> reclaim_list by doing a complex
> >>> game while scanning these 8 PTEs.
> >>>
> >>> if we split in madvise, they become:
> >>>
> >>> PTE0 present for large folioA - splitted from folio 1
> >>> PTE1 present for large folioB - splitted from folio 1
> >>> PTE2 present for another folio2
> >>> PTE3 present for another folio3
> >>> PTE4 present for large folioC - splitted from folio 1
> >>> PTE5 present for large folioD - splitted from folio 1
> >>> PTE6 present for another folio4
> >>> PTE7 present for another folio5
> >>>
> >>> we simply add the above 8 folios into reclaim_list one by one.
> >>>
> >>> I would vote for splitting for partial mapped large folio in madvise.
> >>>
> >
Thanks
Barry
next prev parent reply other threads:[~2024-02-27 9:01 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-25 12:32 Lance Yang
2024-02-26 2:38 ` Yin Fengwei
2024-02-26 8:35 ` Lance Yang
2024-02-26 12:57 ` Ryan Roberts
2024-02-26 13:03 ` David Hildenbrand
2024-02-26 13:47 ` Lance Yang
2024-02-26 4:00 ` Barry Song
2024-02-26 8:37 ` Lance Yang
2024-02-26 8:41 ` David Hildenbrand
2024-02-26 8:55 ` Lance Yang
2024-02-26 13:04 ` Ryan Roberts
2024-02-26 13:50 ` Lance Yang
2024-02-27 1:21 ` Barry Song
2024-02-27 1:48 ` Lance Yang
2024-02-27 2:12 ` Barry Song
2024-02-27 2:15 ` Lance Yang
2024-02-26 20:49 ` Barry Song
2024-02-27 1:51 ` Yin Fengwei
2024-02-27 2:17 ` Barry Song
2024-02-27 6:14 ` Yin Fengwei
2024-02-27 6:40 ` Barry Song
2024-02-27 6:42 ` Barry Song
2024-02-27 7:02 ` Yin Fengwei
2024-02-27 7:11 ` Barry Song
2024-02-27 7:21 ` Barry Song
2024-02-27 7:42 ` Yin Fengwei
2024-02-27 7:54 ` Barry Song
2024-02-27 8:33 ` Yin Fengwei
2024-02-27 9:01 ` Barry Song [this message]
2024-02-26 13:00 ` Ryan Roberts
2024-02-26 13:54 ` Lance Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAGsJ_4yqk6eU+rpym61TnSN_5c5=K+u1FPa5JP92WKxzfrkkcA@mail.gmail.com' \
--to=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=fengwei.yin@intel.com \
--cc=ioworker0@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=peterx@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=shy828301@gmail.com \
--cc=songmuchun@bytedance.com \
--cc=wangkefeng.wang@huawei.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox