[LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
@ 2024-05-10  2:22 Barry Song
  2024-05-10  2:31 ` Matthew Wilcox
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Barry Song @ 2024-05-10  2:22 UTC (permalink / raw)
  To: lsf-pc; +Cc: Linux-MM

Hi,

I'd like to propose a session about the allocation and reclamation of
mTHP. This is related to Yu Zhao's
TAO[1] but not the same.

OPPO has implemented mTHP-like large folios across thousands of
genuine Android devices, utilizing
ARM64 CONT-PTE. However, we've encountered challenges:

- The allocation of mTHP isn't consistently reliable; even after
prolonged use, obtaining large folios
  remains uncertain.
  As an instance, following a few hours of operation, the likelihood
of successfully allocating large
  folios on a phone may decrease to just 2%.

- Mixing large and small folios in the same LRU list can lead to
mutual blocking and unpredictable
  latency during reclamation/allocation.

  For instance, if you require large folios, the LRU list's tail could
be filled with small folios.
  LRU(LF- large folio, SF- small folio):

   LF - LF - LF -  SF - SF - SF - SF - SF - SF -SF - SF - SF - SF - SF - SF - SF

 You might end up reclaiming many small folios yet still struggle to
allocate large folios. Conversely,
 the inverse scenario can occur when the LRU list's tail is populated
with large folios.

   SF - SF - SF -  LF - LF - LF - LF - LF - LF -LF - LF - LF - LF - LF - LF - LF

In OPPO's products, we allocate dedicated pageblocks solely for large
folios allocation, and we've
fine-tuned the LRU mechanism to support dual LRU—one for small folios
and another for large ones.
Dedicated page blocks offer a fundamental guarantee of allocating
large folios. Additionally, segregating
small and large folios into two LRUs ensures that both can be
efficiently reclaimed for their respective
users' requests.  However, while the implementation may lack aesthetic
appeal and is primarily tailored
for product purposes, it isn't fully upstreamable.

You can obtain the architectural diagram of OPPO's approach from link[2].

Therefore, my plan is to present:

- Introduce the architecture of OPPO's mTHP-like approach, which
encompasses additional optimizations
  we've made to address swap fragmentation issues and improve swap
performance, such as dual-zRAM
  and compression/decompression of large folios [3].

- Present OPPO's method of utilizing dedicated page blocks and a
dual-LRU system for mTHP.

- Share our observations from employing Yu Zhao's TAO on Pixel 6 phones.

- Discuss our future direction—are we leaning towards TAO or dedicated
page blocks? If we opt for page
  blocks, how do we plan to resolve the LRU issue?

[1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@google.com/
[2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.png
[3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/

Thanks,
Barry

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-10  2:22 [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation Barry Song
@ 2024-05-10  2:31 ` Matthew Wilcox
  2024-05-10  2:42   ` Barry Song
  2024-05-10 21:18 ` Yang Shi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2024-05-10  2:31 UTC (permalink / raw)
  To: Barry Song; +Cc: lsf-pc, Linux-MM

On Fri, May 10, 2024 at 02:22:02PM +1200, Barry Song wrote:
> I'd like to propose a session about the allocation and reclamation of
> mTHP. This is related to Yu Zhao's
> TAO[1] but not the same.

The important thing to understand about LSFMM is that it's NOT for
presentations.  It's for discussion.  So there's no need to have two
sessions on "this is our variant of this idea".  We have a session
scheduled for reliable allocation of large folios, and we can discuss
both implementations in that same slot.

Having slides that help us understand your problems / solutions are
useful.  Please present them.  But we don't need a separate session.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-10  2:31 ` Matthew Wilcox
@ 2024-05-10  2:42   ` Barry Song
  2024-05-10 14:25     ` [Lsf-pc] " Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2024-05-10  2:42 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, Linux-MM

On Fri, May 10, 2024 at 2:31 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, May 10, 2024 at 02:22:02PM +1200, Barry Song wrote:
> > I'd like to propose a session about the allocation and reclamation of
> > mTHP. This is related to Yu Zhao's
> > TAO[1] but not the same.
>
> The important thing to understand about LSFMM is that it's NOT for
> presentations.  It's for discussion.  So there's no need to have two

I fully understand LSFMM is NOT for presentations but for discussion
while sending the proposal.

> sessions on "this is our variant of this idea".  We have a session
> scheduled for reliable allocation of large folios, and we can discuss
> both implementations in that same slot.

I'm completely open to discussing both topics in the same TAO session.
Having a separate session isn't important to me at all.

>
> Having slides that help us understand your problems / solutions are
> useful.  Please present them.  But we don't need a separate session.

That is exactly my purpose to present problems/solutions. Presenting
problems is quite important as we have to understand problems from
a more comprehensive perspective before taking action.

Thanks
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-10  2:42   ` Barry Song
@ 2024-05-10 14:25     ` Michal Hocko
  2024-05-10 20:33       ` Yu Zhao
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2024-05-10 14:25 UTC (permalink / raw)
  To: Barry Song; +Cc: Matthew Wilcox, Linux-MM, lsf-pc

On Fri 10-05-24 14:42:07, Barry Song wrote:
[...]
> I'm completely open to discussing both topics in the same TAO session.
> Having a separate session isn't important to me at all.

If we happen to still have some topics uncovered we can schedule a
follow up slot.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-10 14:25     ` [Lsf-pc] " Michal Hocko
@ 2024-05-10 20:33       ` Yu Zhao
  2024-05-15  2:42         ` Barry Song
  0 siblings, 1 reply; 22+ messages in thread
From: Yu Zhao @ 2024-05-10 20:33 UTC (permalink / raw)
  To: Michal Hocko, Barry Song; +Cc: Matthew Wilcox, Linux-MM, lsf-pc

[-- Attachment #1: Type: text/plain, Size: 853 bytes --]

On Fri, May 10, 2024 at 8:25 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 10-05-24 14:42:07, Barry Song wrote:
> [...]
> > I'm completely open to discussing both topics in the same TAO session.
> > Having a separate session isn't important to me at all.
>
> If we happen to still have some topics uncovered we can schedule a
> follow up slot.

Thanks, Michal.

Barry, you are very welcome to present your alternative approach in
the same session.

In fact, it seems that we both independently explored this
pageblock-based approach. I did share our design with a few folks and
explained why we put it on the back burner and have been focusing on
the zone-based approach since then.

Let me attach the deck that outlines our design, hopefully, we'll have
enough time to cover some of its ideas if there is enough interest.

[-- Attachment #2: DEFRAG.pdf --]
[-- Type: application/pdf, Size: 57714 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-10  2:22 [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation Barry Song
  2024-05-10  2:31 ` Matthew Wilcox
@ 2024-05-10 21:18 ` Yang Shi
  2024-05-14  9:20   ` Barry Song
  2024-05-15 23:41 ` Matthew Wilcox
  2024-05-22 21:43 ` David Hildenbrand
  3 siblings, 1 reply; 22+ messages in thread
From: Yang Shi @ 2024-05-10 21:18 UTC (permalink / raw)
  To: Barry Song; +Cc: lsf-pc, Linux-MM

On Thu, May 9, 2024 at 7:22 PM Barry Song <21cnbao@gmail.com> wrote:
>
> Hi,
>
> I'd like to propose a session about the allocation and reclamation of
> mTHP. This is related to Yu Zhao's
> TAO[1] but not the same.
>
> OPPO has implemented mTHP-like large folios across thousands of
> genuine Android devices, utilizing
> ARM64 CONT-PTE. However, we've encountered challenges:
>
> - The allocation of mTHP isn't consistently reliable; even after
> prolonged use, obtaining large folios
>   remains uncertain.
>   As an instance, following a few hours of operation, the likelihood
> of successfully allocating large
>   folios on a phone may decrease to just 2%.
>
> - Mixing large and small folios in the same LRU list can lead to
> mutual blocking and unpredictable
>   latency during reclamation/allocation.

I'm also curious how much large folios can improve reclamation
efficiency. Having large folios is supposed to reduce the scan time
since there should be fewer folios on LRU. But IIRC I haven't seen too
much data or benchmark (particularly real life workloads) regarding
this.

>
>   For instance, if you require large folios, the LRU list's tail could
> be filled with small folios.
>   LRU(LF- large folio, SF- small folio):
>
>    LF - LF - LF -  SF - SF - SF - SF - SF - SF -SF - SF - SF - SF - SF - SF - SF
>
>  You might end up reclaiming many small folios yet still struggle to
> allocate large folios. Conversely,
>  the inverse scenario can occur when the LRU list's tail is populated
> with large folios.
>
>    SF - SF - SF -  LF - LF - LF - LF - LF - LF -LF - LF - LF - LF - LF - LF - LF
>
> In OPPO's products, we allocate dedicated pageblocks solely for large
> folios allocation, and we've
> fine-tuned the LRU mechanism to support dual LRU—one for small folios
> and another for large ones.
> Dedicated page blocks offer a fundamental guarantee of allocating
> large folios. Additionally, segregating
> small and large folios into two LRUs ensures that both can be
> efficiently reclaimed for their respective
> users' requests.  However, while the implementation may lack aesthetic
> appeal and is primarily tailored
> for product purposes, it isn't fully upstreamable.
>
> You can obtain the architectural diagram of OPPO's approach from link[2].
>
> Therefore, my plan is to present:
>
> - Introduce the architecture of OPPO's mTHP-like approach, which
> encompasses additional optimizations
>   we've made to address swap fragmentation issues and improve swap
> performance, such as dual-zRAM
>   and compression/decompression of large folios [3].
>
> - Present OPPO's method of utilizing dedicated page blocks and a
> dual-LRU system for mTHP.
>
> - Share our observations from employing Yu Zhao's TAO on Pixel 6 phones.
>
> - Discuss our future direction—are we leaning towards TAO or dedicated
> page blocks? If we opt for page
>   blocks, how do we plan to resolve the LRU issue?
>
> [1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@google.com/
> [2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.png
> [3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
>
> Thanks,
> Barry
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-10 21:18 ` Yang Shi
@ 2024-05-14  9:20   ` Barry Song
  2024-05-15 13:49     ` Yang Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2024-05-14  9:20 UTC (permalink / raw)
  To: Yang Shi; +Cc: lsf-pc, Linux-MM

On Sat, May 11, 2024 at 9:18 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, May 9, 2024 at 7:22 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > Hi,
> >
> > I'd like to propose a session about the allocation and reclamation of
> > mTHP. This is related to Yu Zhao's
> > TAO[1] but not the same.
> >
> > OPPO has implemented mTHP-like large folios across thousands of
> > genuine Android devices, utilizing
> > ARM64 CONT-PTE. However, we've encountered challenges:
> >
> > - The allocation of mTHP isn't consistently reliable; even after
> > prolonged use, obtaining large folios
> >   remains uncertain.
> >   As an instance, following a few hours of operation, the likelihood
> > of successfully allocating large
> >   folios on a phone may decrease to just 2%.
> >
> > - Mixing large and small folios in the same LRU list can lead to
> > mutual blocking and unpredictable
> >   latency during reclamation/allocation.
>
> I'm also curious how much large folios can improve reclamation
> efficiency. Having large folios is supposed to reduce the scan time
> since there should be fewer folios on LRU. But IIRC I haven't seen too
> much data or benchmark (particularly real life workloads) regarding
> this.

Hi Yang,

We lack direct data on this matter, but information from Ryan's THP_SWPOUT
series [1] provides insights as follows:

| alloc size |                baseline |           + this series |
|            | mm-unstable (~v6.9-rc1) |                         |
|:-----------|------------------------:|------------------------:|
| 4K Page    |                    0.0% |                    1.3% |
| 64K THP    |                  -13.6% |                   46.3% |
| 2M THP     |                   91.4% |                   89.6% |


I suspect the -13.6% performance decrease is due to the split
operation. Once the split
is eliminated, the patchset observed a 46.3% increase. It is presumed
that the overhead
required to reclaim 64K is reduced compared to reclaiming 16 * 4K.

However, at present, in actual android devices, we are observing
nearly 100% occurrence
of anon_thp_swpout_fallback after the device has been in operation for
several hours[2].

Hence, it is likely that we will experience regression instead of
improvement due to the
absence of measures to mitigate swap fragmentation.

[1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/lkml/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/

>
> >
> >   For instance, if you require large folios, the LRU list's tail could
> > be filled with small folios.
> >   LRU(LF- large folio, SF- small folio):
> >
> >    LF - LF - LF -  SF - SF - SF - SF - SF - SF -SF - SF - SF - SF - SF - SF - SF
> >
> >  You might end up reclaiming many small folios yet still struggle to
> > allocate large folios. Conversely,
> >  the inverse scenario can occur when the LRU list's tail is populated
> > with large folios.
> >
> >    SF - SF - SF -  LF - LF - LF - LF - LF - LF -LF - LF - LF - LF - LF - LF - LF
> >
> > In OPPO's products, we allocate dedicated pageblocks solely for large
> > folios allocation, and we've
> > fine-tuned the LRU mechanism to support dual LRU—one for small folios
> > and another for large ones.
> > Dedicated page blocks offer a fundamental guarantee of allocating
> > large folios. Additionally, segregating
> > small and large folios into two LRUs ensures that both can be
> > efficiently reclaimed for their respective
> > users' requests.  However, while the implementation may lack aesthetic
> > appeal and is primarily tailored
> > for product purposes, it isn't fully upstreamable.
> >
> > You can obtain the architectural diagram of OPPO's approach from link[2].
> >
> > Therefore, my plan is to present:
> >
> > - Introduce the architecture of OPPO's mTHP-like approach, which
> > encompasses additional optimizations
> >   we've made to address swap fragmentation issues and improve swap
> > performance, such as dual-zRAM
> >   and compression/decompression of large folios [3].
> >
> > - Present OPPO's method of utilizing dedicated page blocks and a
> > dual-LRU system for mTHP.
> >
> > - Share our observations from employing Yu Zhao's TAO on Pixel 6 phones.
> >
> > - Discuss our future direction—are we leaning towards TAO or dedicated
> > page blocks? If we opt for page
> >   blocks, how do we plan to resolve the LRU issue?
> >
> > [1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@google.com/
> > [2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.png
> > [3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> >
Thanks,
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-10 20:33       ` Yu Zhao
@ 2024-05-15  2:42         ` Barry Song
  2024-05-15 10:21           ` Karim Manaouil
                             ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Barry Song @ 2024-05-15  2:42 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Michal Hocko, Matthew Wilcox, Linux-MM, lsf-pc

[-- Attachment #1: Type: text/plain, Size: 1416 bytes --]

On Sat, May 11, 2024 at 8:33 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Fri, May 10, 2024 at 8:25 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 10-05-24 14:42:07, Barry Song wrote:
> > [...]
> > > I'm completely open to discussing both topics in the same TAO session.
> > > Having a separate session isn't important to me at all.
> >
> > If we happen to still have some topics uncovered we can schedule a
> > follow up slot.
>
> Thanks, Michal.
>
> Barry, you are very welcome to present your alternative approach in
> the same session.
>
> In fact, it seems that we both independently explored this
> pageblock-based approach. I did share our design with a few folks and
> explained why we put it on the back burner and have been focusing on
> the zone-based approach since then.
>
> Let me attach the deck that outlines our design, hopefully, we'll have
> enough time to cover some of its ideas if there is enough interest.

Thank you. I'm also attaching our findings regarding mTHP allocation &
reclamation fallback, along with our approach and observations on running
TAO on Pixel 6.

From deploying mTHP on numerous phones, I've learned that I'm not keen
on using dedicated page blocks for mTHP. Instead, I prefer the virtual zone
approach due to the folio size conflict in a single LRU. Further details are
available in the attached PDF.

Best regards
Barry

[-- Attachment #2: mTHP_allocation_reclamation.pdf --]
[-- Type: application/pdf, Size: 1215777 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-15  2:42         ` Barry Song
@ 2024-05-15 10:21           ` Karim Manaouil
  2024-05-15 10:59           ` Yu Zhao
  2024-05-15 13:50           ` Yang Shi
  2 siblings, 0 replies; 22+ messages in thread
From: Karim Manaouil @ 2024-05-15 10:21 UTC (permalink / raw)
  To: Barry Song, Yu Zhao
  Cc: Michal Hocko, Karim Manaouil, Matthew Wilcox, Linux-MM, lsf-pc

On Wed, May 15, 2024 at 02:42:55PM +1200, Barry Song wrote:
> On Sat, May 11, 2024 at 8:33 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Fri, May 10, 2024 at 8:25 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 10-05-24 14:42:07, Barry Song wrote:
> > > [...]
> > > > I'm completely open to discussing both topics in the same TAO session.
> > > > Having a separate session isn't important to me at all.
> > >
> > > If we happen to still have some topics uncovered we can schedule a
> > > follow up slot.
> >
> > Thanks, Michal.
> >
> > Barry, you are very welcome to present your alternative approach in
> > the same session.
> >
> > In fact, it seems that we both independently explored this
> > pageblock-based approach. I did share our design with a few folks and
> > explained why we put it on the back burner and have been focusing on
> > the zone-based approach since then.
> >
> > Let me attach the deck that outlines our design, hopefully, we'll have
> > enough time to cover some of its ideas if there is enough interest.
> 
> Thank you. I'm also attaching our findings regarding mTHP allocation &
> reclamation fallback, along with our approach and observations on running
> TAO on Pixel 6.
> 
> From deploying mTHP on numerous phones, I've learned that I'm not keen
> on using dedicated page blocks for mTHP. Instead, I prefer the virtual zone
> approach due to the folio size conflict in a single LRU. Further details are
> available in the attached PDF.

Could you, please, share the outcome of the discussion, from LSFMM,
here?

Best
Karim


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-15  2:42         ` Barry Song
  2024-05-15 10:21           ` Karim Manaouil
@ 2024-05-15 10:59           ` Yu Zhao
  2024-05-15 13:50           ` Yang Shi
  2 siblings, 0 replies; 22+ messages in thread
From: Yu Zhao @ 2024-05-15 10:59 UTC (permalink / raw)
  To: Barry Song; +Cc: Michal Hocko, Matthew Wilcox, Linux-MM, lsf-pc

[-- Attachment #1: Type: text/plain, Size: 1625 bytes --]

On Tue, May 14, 2024 at 8:43 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 11, 2024 at 8:33 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Fri, May 10, 2024 at 8:25 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 10-05-24 14:42:07, Barry Song wrote:
> > > [...]
> > > > I'm completely open to discussing both topics in the same TAO session.
> > > > Having a separate session isn't important to me at all.
> > >
> > > If we happen to still have some topics uncovered we can schedule a
> > > follow up slot.
> >
> > Thanks, Michal.
> >
> > Barry, you are very welcome to present your alternative approach in
> > the same session.
> >
> > In fact, it seems that we both independently explored this
> > pageblock-based approach. I did share our design with a few folks and
> > explained why we put it on the back burner and have been focusing on
> > the zone-based approach since then.
> >
> > Let me attach the deck that outlines our design, hopefully, we'll have
> > enough time to cover some of its ideas if there is enough interest.
>
> Thank you. I'm also attaching our findings regarding mTHP allocation &
> reclamation fallback, along with our approach and observations on running
> TAO on Pixel 6.
>
> From deploying mTHP on numerous phones, I've learned that I'm not keen
> on using dedicated page blocks for mTHP. Instead, I prefer the virtual zone
> approach due to the folio size conflict in a single LRU. Further details are
> available in the attached PDF.

It seems your PDF didn't go through -- it might be too big. Let me
compress it and try.

[-- Attachment #2: mTHP_allocation_reclamation.pdf --]
[-- Type: application/pdf, Size: 1215777 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-14  9:20   ` Barry Song
@ 2024-05-15 13:49     ` Yang Shi
  2024-05-15 19:25       ` Barry Song
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Shi @ 2024-05-15 13:49 UTC (permalink / raw)
  To: Barry Song; +Cc: lsf-pc, Linux-MM

On Tue, May 14, 2024 at 3:20 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 11, 2024 at 9:18 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, May 9, 2024 at 7:22 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I'd like to propose a session about the allocation and reclamation of
> > > mTHP. This is related to Yu Zhao's
> > > TAO[1] but not the same.
> > >
> > > OPPO has implemented mTHP-like large folios across thousands of
> > > genuine Android devices, utilizing
> > > ARM64 CONT-PTE. However, we've encountered challenges:
> > >
> > > - The allocation of mTHP isn't consistently reliable; even after
> > > prolonged use, obtaining large folios
> > >   remains uncertain.
> > >   As an instance, following a few hours of operation, the likelihood
> > > of successfully allocating large
> > >   folios on a phone may decrease to just 2%.
> > >
> > > - Mixing large and small folios in the same LRU list can lead to
> > > mutual blocking and unpredictable
> > >   latency during reclamation/allocation.
> >
> > I'm also curious how much large folios can improve reclamation
> > efficiency. Having large folios is supposed to reduce the scan time
> > since there should be fewer folios on LRU. But IIRC I haven't seen too
> > much data or benchmark (particularly real life workloads) regarding
> > this.
>
> Hi Yang,
>
> We lack direct data on this matter, but information from Ryan's THP_SWPOUT
> series [1] provides insights as follows:
>
> | alloc size |                baseline |           + this series |
> |            | mm-unstable (~v6.9-rc1) |                         |
> |:-----------|------------------------:|------------------------:|
> | 4K Page    |                    0.0% |                    1.3% |
> | 64K THP    |                  -13.6% |                   46.3% |
> | 2M THP     |                   91.4% |                   89.6% |
>
>
> I suspect the -13.6% performance decrease is due to the split
> operation. Once the split
> is eliminated, the patchset observed a 46.3% increase. It is presumed
> that the overhead
> required to reclaim 64K is reduced compared to reclaiming 16 * 4K.

Thank you. Actually I care about 4k vs 64k vs 256k ...

I did a simple test by calling MADV_PAGEOUT on 4G memory w/ the
swapout optimization then measured the time spent in madvise, I can
see the time was reduced by ~23% between 64k vs 4k. Then there is no
noticeable reduction between 64k and larger sizes.

Actually I saw such a pattern (performance doesn't scale with page
size after 64K) with some real life workload benchmark. I'm going to
talk about it in today's LSF/MM.

>
> However, at present, in actual android devices, we are observing
> nearly 100% occurrence
> of anon_thp_swpout_fallback after the device has been in operation for
> several hours[2].
>
> Hence, it is likely that we will experience regression instead of
> improvement due to the
> absence of measures to mitigate swap fragmentation.
>
> [1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/lkml/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>
> >
> > >
> > >   For instance, if you require large folios, the LRU list's tail could
> > > be filled with small folios.
> > >   LRU(LF- large folio, SF- small folio):
> > >
> > >    LF - LF - LF -  SF - SF - SF - SF - SF - SF -SF - SF - SF - SF - SF - SF - SF
> > >
> > >  You might end up reclaiming many small folios yet still struggle to
> > > allocate large folios. Conversely,
> > >  the inverse scenario can occur when the LRU list's tail is populated
> > > with large folios.
> > >
> > >    SF - SF - SF -  LF - LF - LF - LF - LF - LF -LF - LF - LF - LF - LF - LF - LF
> > >
> > > In OPPO's products, we allocate dedicated pageblocks solely for large
> > > folios allocation, and we've
> > > fine-tuned the LRU mechanism to support dual LRU—one for small folios
> > > and another for large ones.
> > > Dedicated page blocks offer a fundamental guarantee of allocating
> > > large folios. Additionally, segregating
> > > small and large folios into two LRUs ensures that both can be
> > > efficiently reclaimed for their respective
> > > users' requests.  However, while the implementation may lack aesthetic
> > > appeal and is primarily tailored
> > > for product purposes, it isn't fully upstreamable.
> > >
> > > You can obtain the architectural diagram of OPPO's approach from link[2].
> > >
> > > Therefore, my plan is to present:
> > >
> > > - Introduce the architecture of OPPO's mTHP-like approach, which
> > > encompasses additional optimizations
> > >   we've made to address swap fragmentation issues and improve swap
> > > performance, such as dual-zRAM
> > >   and compression/decompression of large folios [3].
> > >
> > > - Present OPPO's method of utilizing dedicated page blocks and a
> > > dual-LRU system for mTHP.
> > >
> > > - Share our observations from employing Yu Zhao's TAO on Pixel 6 phones.
> > >
> > > - Discuss our future direction—are we leaning towards TAO or dedicated
> > > page blocks? If we opt for page
> > >   blocks, how do we plan to resolve the LRU issue?
> > >
> > > [1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@google.com/
> > > [2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.png
> > > [3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> > >
> Thanks,
> Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-15  2:42         ` Barry Song
  2024-05-15 10:21           ` Karim Manaouil
  2024-05-15 10:59           ` Yu Zhao
@ 2024-05-15 13:50           ` Yang Shi
  2024-05-15 18:14             ` Barry Song
  2 siblings, 1 reply; 22+ messages in thread
From: Yang Shi @ 2024-05-15 13:50 UTC (permalink / raw)
  To: Barry Song; +Cc: Yu Zhao, Michal Hocko, Matthew Wilcox, Linux-MM, lsf-pc

On Tue, May 14, 2024 at 8:45 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 11, 2024 at 8:33 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Fri, May 10, 2024 at 8:25 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 10-05-24 14:42:07, Barry Song wrote:
> > > [...]
> > > > I'm completely open to discussing both topics in the same TAO session.
> > > > Having a separate session isn't important to me at all.
> > >
> > > If we happen to still have some topics uncovered we can schedule a
> > > follow up slot.
> >
> > Thanks, Michal.
> >
> > Barry, you are very welcome to present your alternative approach in
> > the same session.
> >
> > In fact, it seems that we both independently explored this
> > pageblock-based approach. I did share our design with a few folks and
> > explained why we put it on the back burner and have been focusing on
> > the zone-based approach since then.
> >
> > Let me attach the deck that outlines our design, hopefully, we'll have
> > enough time to cover some of its ideas if there is enough interest.
>
> Thank you. I'm also attaching our findings regarding mTHP allocation &
> reclamation fallback, along with our approach and observations on running
> TAO on Pixel 6.
>
> From deploying mTHP on numerous phones, I've learned that I'm not keen
> on using dedicated page blocks for mTHP. Instead, I prefer the virtual zone
> approach due to the folio size conflict in a single LRU. Further details are
> available in the attached PDF.

I'd like to know what page sizes were enabled for your test? A single
page size, for example, 64K, or all possible page sizes?

>
> Best regards
> Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-15 13:50           ` Yang Shi
@ 2024-05-15 18:14             ` Barry Song
  0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2024-05-15 18:14 UTC (permalink / raw)
  To: Yang Shi; +Cc: Yu Zhao, Michal Hocko, Matthew Wilcox, Linux-MM, lsf-pc

On Thu, May 16, 2024 at 1:50 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, May 14, 2024 at 8:45 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, May 11, 2024 at 8:33 AM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Fri, May 10, 2024 at 8:25 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 10-05-24 14:42:07, Barry Song wrote:
> > > > [...]
> > > > > I'm completely open to discussing both topics in the same TAO session.
> > > > > Having a separate session isn't important to me at all.
> > > >
> > > > If we happen to still have some topics uncovered we can schedule a
> > > > follow up slot.
> > >
> > > Thanks, Michal.
> > >
> > > Barry, you are very welcome to present your alternative approach in
> > > the same session.
> > >
> > > In fact, it seems that we both independently explored this
> > > pageblock-based approach. I did share our design with a few folks and
> > > explained why we put it on the back burner and have been focusing on
> > > the zone-based approach since then.
> > >
> > > Let me attach the deck that outlines our design, hopefully, we'll have
> > > enough time to cover some of its ideas if there is enough interest.
> >
> > Thank you. I'm also attaching our findings regarding mTHP allocation &
> > reclamation fallback, along with our approach and observations on running
> > TAO on Pixel 6.
> >
> > From deploying mTHP on numerous phones, I've learned that I'm not keen
> > on using dedicated page blocks for mTHP. Instead, I prefer the virtual zone
> > approach due to the folio size conflict in a single LRU. Further details are
> > available in the attached PDF.
>
> I'd like to know what page sizes were enabled for your test? A single
> page size, for example, 64K, or all possible page sizes?

A single page size - 64K. In my case, I doubt that enabling all
conceivable page sizes would improve performance, as user space may
have its own unique patterns for operations like MADV_DONTNEED. Since
much of the memory is likely allocated to the heap, enabling all sizes
could lead to numerous fragments within these heap VMAs.

Conversely, without a readahead-like mechanism, allocating large
folio sizes might result in portions of memory being seldom or never
used. Considering that 64K is one of the base page sizes for ARM hardware,
it strikes a balance between being adequately sized and not overly large.

>
> >
> > Best regards
> > Barry

Thanks
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-15 13:49     ` Yang Shi
@ 2024-05-15 19:25       ` Barry Song
  2024-05-15 21:41         ` Yang Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2024-05-15 19:25 UTC (permalink / raw)
  To: Yang Shi; +Cc: lsf-pc, Linux-MM

On Thu, May 16, 2024 at 1:49 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, May 14, 2024 at 3:20 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, May 11, 2024 at 9:18 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Thu, May 9, 2024 at 7:22 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I'd like to propose a session about the allocation and reclamation of
> > > > mTHP. This is related to Yu Zhao's
> > > > TAO[1] but not the same.
> > > >
> > > > OPPO has implemented mTHP-like large folios across thousands of
> > > > genuine Android devices, utilizing
> > > > ARM64 CONT-PTE. However, we've encountered challenges:
> > > >
> > > > - The allocation of mTHP isn't consistently reliable; even after
> > > > prolonged use, obtaining large folios
> > > >   remains uncertain.
> > > >   As an instance, following a few hours of operation, the likelihood
> > > > of successfully allocating large
> > > >   folios on a phone may decrease to just 2%.
> > > >
> > > > - Mixing large and small folios in the same LRU list can lead to
> > > > mutual blocking and unpredictable
> > > >   latency during reclamation/allocation.
> > >
> > > I'm also curious how much large folios can improve reclamation
> > > efficiency. Having large folios is supposed to reduce the scan time
> > > since there should be fewer folios on LRU. But IIRC I haven't seen too
> > > much data or benchmark (particularly real life workloads) regarding
> > > this.
> >
> > Hi Yang,
> >
> > We lack direct data on this matter, but information from Ryan's THP_SWPOUT
> > series [1] provides insights as follows:
> >
> > | alloc size |                baseline |           + this series |
> > |            | mm-unstable (~v6.9-rc1) |                         |
> > |:-----------|------------------------:|------------------------:|
> > | 4K Page    |                    0.0% |                    1.3% |
> > | 64K THP    |                  -13.6% |                   46.3% |
> > | 2M THP     |                   91.4% |                   89.6% |
> >
> >
> > I suspect the -13.6% performance decrease is due to the split
> > operation. Once the split
> > is eliminated, the patchset observed a 46.3% increase. It is presumed
> > that the overhead
> > required to reclaim 64K is reduced compared to reclaiming 16 * 4K.
>
> Thank you. Actually I care about 4k vs 64k vs 256k ...
>
> I did a simple test by calling MADV_PAGEOUT on 4G memory w/ the
> swapout optimization then measured the time spent in madvise, I can
> see the time was reduced by ~23% between 64k vs 4k. Then there is no
> noticeable reduction between 64k and larger sizes.

If you engage in perf analysis, what observations can you make? I suspect that
even with larger folios, the function try_to_unmap_one() continues to iterate
through PTEs individually.
If we're able to batch the unmapping process for the entire folio, we might
observe improved performance.

>
> Actually I saw such a pattern (performance doesn't scale with page
> size after 64K) with some real life workload benchmark. I'm going to
> talk about it in today's LSF/MM.
>
> >
> > However, at present, in actual android devices, we are observing
> > nearly 100% occurrence
> > of anon_thp_swpout_fallback after the device has been in operation for
> > several hours[2].
> >
> > Hence, it is likely that we will experience regression instead of
> > improvement due to the
> > absence of measures to mitigate swap fragmentation.
> >
> > [1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts@arm.com/
> > [2] https://lore.kernel.org/lkml/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >
> > >
> > > >
> > > >   For instance, if you require large folios, the LRU list's tail could
> > > > be filled with small folios.
> > > >   LRU(LF- large folio, SF- small folio):
> > > >
> > > >    LF - LF - LF -  SF - SF - SF - SF - SF - SF -SF - SF - SF - SF - SF - SF - SF
> > > >
> > > >  You might end up reclaiming many small folios yet still struggle to
> > > > allocate large folios. Conversely,
> > > >  the inverse scenario can occur when the LRU list's tail is populated
> > > > with large folios.
> > > >
> > > >    SF - SF - SF -  LF - LF - LF - LF - LF - LF -LF - LF - LF - LF - LF - LF - LF
> > > >
> > > > In OPPO's products, we allocate dedicated pageblocks solely for large
> > > > folios allocation, and we've
> > > > fine-tuned the LRU mechanism to support dual LRU—one for small folios
> > > > and another for large ones.
> > > > Dedicated page blocks offer a fundamental guarantee of allocating
> > > > large folios. Additionally, segregating
> > > > small and large folios into two LRUs ensures that both can be
> > > > efficiently reclaimed for their respective
> > > > users' requests.  However, while the implementation may lack aesthetic
> > > > appeal and is primarily tailored
> > > > for product purposes, it isn't fully upstreamable.
> > > >
> > > > You can obtain the architectural diagram of OPPO's approach from link[2].
> > > >
> > > > Therefore, my plan is to present:
> > > >
> > > > - Introduce the architecture of OPPO's mTHP-like approach, which
> > > > encompasses additional optimizations
> > > >   we've made to address swap fragmentation issues and improve swap
> > > > performance, such as dual-zRAM
> > > >   and compression/decompression of large folios [3].
> > > >
> > > > - Present OPPO's method of utilizing dedicated page blocks and a
> > > > dual-LRU system for mTHP.
> > > >
> > > > - Share our observations from employing Yu Zhao's TAO on Pixel 6 phones.
> > > >
> > > > - Discuss our future direction—are we leaning towards TAO or dedicated
> > > > page blocks? If we opt for page
> > > >   blocks, how do we plan to resolve the LRU issue?
> > > >
> > > > [1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@google.com/
> > > > [2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.png
> > > > [3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> > > >

Thanks,
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-15 19:25       ` Barry Song
@ 2024-05-15 21:41         ` Yang Shi
  2024-05-15 22:15           ` Barry Song
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Shi @ 2024-05-15 21:41 UTC (permalink / raw)
  To: Barry Song; +Cc: lsf-pc, Linux-MM

On Wed, May 15, 2024 at 1:25 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, May 16, 2024 at 1:49 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Tue, May 14, 2024 at 3:20 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, May 11, 2024 at 9:18 AM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Thu, May 9, 2024 at 7:22 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I'd like to propose a session about the allocation and reclamation of
> > > > > mTHP. This is related to Yu Zhao's
> > > > > TAO[1] but not the same.
> > > > >
> > > > > OPPO has implemented mTHP-like large folios across thousands of
> > > > > genuine Android devices, utilizing
> > > > > ARM64 CONT-PTE. However, we've encountered challenges:
> > > > >
> > > > > - The allocation of mTHP isn't consistently reliable; even after
> > > > > prolonged use, obtaining large folios
> > > > >   remains uncertain.
> > > > >   As an instance, following a few hours of operation, the likelihood
> > > > > of successfully allocating large
> > > > >   folios on a phone may decrease to just 2%.
> > > > >
> > > > > - Mixing large and small folios in the same LRU list can lead to
> > > > > mutual blocking and unpredictable
> > > > >   latency during reclamation/allocation.
> > > >
> > > > I'm also curious how much large folios can improve reclamation
> > > > efficiency. Having large folios is supposed to reduce the scan time
> > > > since there should be fewer folios on LRU. But IIRC I haven't seen too
> > > > much data or benchmark (particularly real life workloads) regarding
> > > > this.
> > >
> > > Hi Yang,
> > >
> > > We lack direct data on this matter, but information from Ryan's THP_SWPOUT
> > > series [1] provides insights as follows:
> > >
> > > | alloc size |                baseline |           + this series |
> > > |            | mm-unstable (~v6.9-rc1) |                         |
> > > |:-----------|------------------------:|------------------------:|
> > > | 4K Page    |                    0.0% |                    1.3% |
> > > | 64K THP    |                  -13.6% |                   46.3% |
> > > | 2M THP     |                   91.4% |                   89.6% |
> > >
> > >
> > > I suspect the -13.6% performance decrease is due to the split
> > > operation. Once the split
> > > is eliminated, the patchset observed a 46.3% increase. It is presumed
> > > that the overhead
> > > required to reclaim 64K is reduced compared to reclaiming 16 * 4K.
> >
> > Thank you. Actually I care about 4k vs 64k vs 256k ...
> >
> > I did a simple test by calling MADV_PAGEOUT on 4G memory w/ the
> > swapout optimization then measured the time spent in madvise, I can
> > see the time was reduced by ~23% between 64k vs 4k. Then there is no
> > noticeable reduction between 64k and larger sizes.
>
> If you engage in perf analysis, what observations can you make? I suspect that
> even with larger folios, the function try_to_unmap_one() continues to iterate
> through PTEs individually.

Yes, I think so.

> If we're able to batch the unmapping process for the entire folio, we might
> observe improved performance.

I did profiling to my benchmark, I didn't see try_to_unmap showed as
hot spot. The time is actually spent in zram I/O.

But batching try_to_unmap() may show some improvement. Did you do it
in your kernel? It should be worth exploring.

>
> >
> > Actually I saw such a pattern (performance doesn't scale with page
> > size after 64K) with some real life workload benchmark. I'm going to
> > talk about it in today's LSF/MM.
> >
> > >
> > > However, at present, in actual android devices, we are observing
> > > nearly 100% occurrence
> > > of anon_thp_swpout_fallback after the device has been in operation for
> > > several hours[2].
> > >
> > > Hence, it is likely that we will experience regression instead of
> > > improvement due to the
> > > absence of measures to mitigate swap fragmentation.
> > >
> > > [1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts@arm.com/
> > > [2] https://lore.kernel.org/lkml/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> > >
> > > >
> > > > >
> > > > >   For instance, if you require large folios, the LRU list's tail could
> > > > > be filled with small folios.
> > > > >   LRU(LF- large folio, SF- small folio):
> > > > >
> > > > >    LF - LF - LF -  SF - SF - SF - SF - SF - SF -SF - SF - SF - SF - SF - SF - SF
> > > > >
> > > > >  You might end up reclaiming many small folios yet still struggle to
> > > > > allocate large folios. Conversely,
> > > > >  the inverse scenario can occur when the LRU list's tail is populated
> > > > > with large folios.
> > > > >
> > > > >    SF - SF - SF -  LF - LF - LF - LF - LF - LF -LF - LF - LF - LF - LF - LF - LF
> > > > >
> > > > > In OPPO's products, we allocate dedicated pageblocks solely for large
> > > > > folios allocation, and we've
> > > > > fine-tuned the LRU mechanism to support dual LRU—one for small folios
> > > > > and another for large ones.
> > > > > Dedicated page blocks offer a fundamental guarantee of allocating
> > > > > large folios. Additionally, segregating
> > > > > small and large folios into two LRUs ensures that both can be
> > > > > efficiently reclaimed for their respective
> > > > > users' requests.  However, while the implementation may lack aesthetic
> > > > > appeal and is primarily tailored
> > > > > for product purposes, it isn't fully upstreamable.
> > > > >
> > > > > You can obtain the architectural diagram of OPPO's approach from link[2].
> > > > >
> > > > > Therefore, my plan is to present:
> > > > >
> > > > > - Introduce the architecture of OPPO's mTHP-like approach, which
> > > > > encompasses additional optimizations
> > > > >   we've made to address swap fragmentation issues and improve swap
> > > > > performance, such as dual-zRAM
> > > > >   and compression/decompression of large folios [3].
> > > > >
> > > > > - Present OPPO's method of utilizing dedicated page blocks and a
> > > > > dual-LRU system for mTHP.
> > > > >
> > > > > - Share our observations from employing Yu Zhao's TAO on Pixel 6 phones.
> > > > >
> > > > > - Discuss our future direction—are we leaning towards TAO or dedicated
> > > > > page blocks? If we opt for page
> > > > >   blocks, how do we plan to resolve the LRU issue?
> > > > >
> > > > > [1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@google.com/
> > > > > [2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.png
> > > > > [3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> > > > >
>
> Thanks,
> Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-15 21:41         ` Yang Shi
@ 2024-05-15 22:15           ` Barry Song
  0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2024-05-15 22:15 UTC (permalink / raw)
  To: Yang Shi; +Cc: lsf-pc, Linux-MM

On Thu, May 16, 2024 at 9:41 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, May 15, 2024 at 1:25 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Thu, May 16, 2024 at 1:49 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Tue, May 14, 2024 at 3:20 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sat, May 11, 2024 at 9:18 AM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Thu, May 9, 2024 at 7:22 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'd like to propose a session about the allocation and reclamation of
> > > > > > mTHP. This is related to Yu Zhao's
> > > > > > TAO[1] but not the same.
> > > > > >
> > > > > > OPPO has implemented mTHP-like large folios across thousands of
> > > > > > genuine Android devices, utilizing
> > > > > > ARM64 CONT-PTE. However, we've encountered challenges:
> > > > > >
> > > > > > - The allocation of mTHP isn't consistently reliable; even after
> > > > > > prolonged use, obtaining large folios
> > > > > >   remains uncertain.
> > > > > >   As an instance, following a few hours of operation, the likelihood
> > > > > > of successfully allocating large
> > > > > >   folios on a phone may decrease to just 2%.
> > > > > >
> > > > > > - Mixing large and small folios in the same LRU list can lead to
> > > > > > mutual blocking and unpredictable
> > > > > >   latency during reclamation/allocation.
> > > > >
> > > > > I'm also curious how much large folios can improve reclamation
> > > > > efficiency. Having large folios is supposed to reduce the scan time
> > > > > since there should be fewer folios on LRU. But IIRC I haven't seen too
> > > > > much data or benchmark (particularly real life workloads) regarding
> > > > > this.
> > > >
> > > > Hi Yang,
> > > >
> > > > We lack direct data on this matter, but information from Ryan's THP_SWPOUT
> > > > series [1] provides insights as follows:
> > > >
> > > > | alloc size |                baseline |           + this series |
> > > > |            | mm-unstable (~v6.9-rc1) |                         |
> > > > |:-----------|------------------------:|------------------------:|
> > > > | 4K Page    |                    0.0% |                    1.3% |
> > > > | 64K THP    |                  -13.6% |                   46.3% |
> > > > | 2M THP     |                   91.4% |                   89.6% |
> > > >
> > > >
> > > > I suspect the -13.6% performance decrease is due to the split
> > > > operation. Once the split
> > > > is eliminated, the patchset observed a 46.3% increase. It is presumed
> > > > that the overhead
> > > > required to reclaim 64K is reduced compared to reclaiming 16 * 4K.
> > >
> > > Thank you. Actually I care about 4k vs 64k vs 256k ...
> > >
> > > I did a simple test by calling MADV_PAGEOUT on 4G memory w/ the
> > > swapout optimization then measured the time spent in madvise, I can
> > > see the time was reduced by ~23% between 64k vs 4k. Then there is no
> > > noticeable reduction between 64k and larger sizes.
> >
> > If you engage in perf analysis, what observations can you make? I suspect that
> > even with larger folios, the function try_to_unmap_one() continues to iterate
> > through PTEs individually.
>
> Yes, I think so.
>
> > If we're able to batch the unmapping process for the entire folio, we might
> > observe improved performance.
>
> I did profiling to my benchmark, I didn't see try_to_unmap showed as
> hot spot. The time is actually spent in zram I/O.
>
> But batching try_to_unmap() may show some improvement. Did you do it
> in your kernel? It should be worth exploring.

Not at the moment. However, we've experimented with compressing large
folios in larger granularities, like 64KiB [1]. This experimentation has yielded
significant enhancements in CPU utilization reduction and compression rates.

You can adjust the granularity through the ZSMALLOC_MULTI_PAGES_ORDER
setting, with the default value being 4.

Without our patch, zRAM compresses large folios in 4KiB granularity by iterating
each subpage.

[1] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/

>
> >
> > >
> > > Actually I saw such a pattern (performance doesn't scale with page
> > > size after 64K) with some real life workload benchmark. I'm going to
> > > talk about it in today's LSF/MM.
> > >
> > > >
> > > > However, at present, in actual android devices, we are observing
> > > > nearly 100% occurrence
> > > > of anon_thp_swpout_fallback after the device has been in operation for
> > > > several hours[2].
> > > >
> > > > Hence, it is likely that we will experience regression instead of
> > > > improvement due to the
> > > > absence of measures to mitigate swap fragmentation.
> > > >
> > > > [1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts@arm.com/
> > > > [2] https://lore.kernel.org/lkml/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> > > >
> > > > >
> > > > > >
> > > > > >   For instance, if you require large folios, the LRU list's tail could
> > > > > > be filled with small folios.
> > > > > >   LRU(LF- large folio, SF- small folio):
> > > > > >
> > > > > >    LF - LF - LF -  SF - SF - SF - SF - SF - SF -SF - SF - SF - SF - SF - SF - SF
> > > > > >
> > > > > >  You might end up reclaiming many small folios yet still struggle to
> > > > > > allocate large folios. Conversely,
> > > > > >  the inverse scenario can occur when the LRU list's tail is populated
> > > > > > with large folios.
> > > > > >
> > > > > >    SF - SF - SF -  LF - LF - LF - LF - LF - LF -LF - LF - LF - LF - LF - LF - LF
> > > > > >
> > > > > > In OPPO's products, we allocate dedicated pageblocks solely for large
> > > > > > folios allocation, and we've
> > > > > > fine-tuned the LRU mechanism to support dual LRU—one for small folios
> > > > > > and another for large ones.
> > > > > > Dedicated page blocks offer a fundamental guarantee of allocating
> > > > > > large folios. Additionally, segregating
> > > > > > small and large folios into two LRUs ensures that both can be
> > > > > > efficiently reclaimed for their respective
> > > > > > users' requests.  However, while the implementation may lack aesthetic
> > > > > > appeal and is primarily tailored
> > > > > > for product purposes, it isn't fully upstreamable.
> > > > > >
> > > > > > You can obtain the architectural diagram of OPPO's approach from link[2].
> > > > > >
> > > > > > Therefore, my plan is to present:
> > > > > >
> > > > > > - Introduce the architecture of OPPO's mTHP-like approach, which
> > > > > > encompasses additional optimizations
> > > > > >   we've made to address swap fragmentation issues and improve swap
> > > > > > performance, such as dual-zRAM
> > > > > >   and compression/decompression of large folios [3].
> > > > > >
> > > > > > - Present OPPO's method of utilizing dedicated page blocks and a
> > > > > > dual-LRU system for mTHP.
> > > > > >
> > > > > > - Share our observations from employing Yu Zhao's TAO on Pixel 6 phones.
> > > > > >
> > > > > > - Discuss our future direction—are we leaning towards TAO or dedicated
> > > > > > page blocks? If we opt for page
> > > > > >   blocks, how do we plan to resolve the LRU issue?
> > > > > >
> > > > > > [1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@google.com/
> > > > > > [2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.png
> > > > > > [3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> > > > > >
> >
Thanks
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-10  2:22 [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation Barry Song
  2024-05-10  2:31 ` Matthew Wilcox
  2024-05-10 21:18 ` Yang Shi
@ 2024-05-15 23:41 ` Matthew Wilcox
  2024-05-16  0:25   ` Barry Song
  2024-05-22 21:43 ` David Hildenbrand
  3 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2024-05-15 23:41 UTC (permalink / raw)
  To: Barry Song; +Cc: lsf-pc, Linux-MM

On Fri, May 10, 2024 at 02:22:02PM +1200, Barry Song wrote:
> - The allocation of mTHP isn't consistently reliable; even after
> prolonged use, obtaining large folios
>   remains uncertain.
>   As an instance, following a few hours of operation, the likelihood
> of successfully allocating large
>   folios on a phone may decrease to just 2%.

I'd be curious to know whether you're using a filesystem that supports
large folios or whether the pagecache is full of small folios?  The more
places that allocate large folios, the easier it becomes to allocate
large folios.

Also, do you have CONFIG_COMPACTION enabled?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-15 23:41 ` Matthew Wilcox
@ 2024-05-16  0:25   ` Barry Song
  2024-05-16  3:19     ` Gao Xiang
  0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2024-05-16  0:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, Linux-MM

On Thu, May 16, 2024 at 11:41 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, May 10, 2024 at 02:22:02PM +1200, Barry Song wrote:
> > - The allocation of mTHP isn't consistently reliable; even after
> > prolonged use, obtaining large folios
> >   remains uncertain.
> >   As an instance, following a few hours of operation, the likelihood
> > of successfully allocating large
> >   folios on a phone may decrease to just 2%.
>
> I'd be curious to know whether you're using a filesystem that supports
> large folios or whether the pagecache is full of small folios?  The more
> places that allocate large folios, the easier it becomes to allocate
> large folios.

I am not using file systems with large folios as neither erofs(compressed files)
nor f2fs supports large folios. So, yes, the page cache is full of small folios.

>
> Also, do you have CONFIG_COMPACTION enabled?
>

Yes. COMPACTION is enabled.

Thanks
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-16  0:25   ` Barry Song
@ 2024-05-16  3:19     ` Gao Xiang
  2024-05-16  6:57       ` Barry Song
  0 siblings, 1 reply; 22+ messages in thread
From: Gao Xiang @ 2024-05-16  3:19 UTC (permalink / raw)
  To: Barry Song, Matthew Wilcox; +Cc: lsf-pc, Linux-MM



On 2024/5/16 08:25, Barry Song wrote:
> On Thu, May 16, 2024 at 11:41 AM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Fri, May 10, 2024 at 02:22:02PM +1200, Barry Song wrote:
>>> - The allocation of mTHP isn't consistently reliable; even after
>>> prolonged use, obtaining large folios
>>>    remains uncertain.
>>>    As an instance, following a few hours of operation, the likelihood
>>> of successfully allocating large
>>>    folios on a phone may decrease to just 2%.
>>
>> I'd be curious to know whether you're using a filesystem that supports
>> large folios or whether the pagecache is full of small folios?  The more
>> places that allocate large folios, the easier it becomes to allocate
>> large folios.
> 
> I am not using file systems with large folios as neither erofs(compressed files)

Side note:  I will offically support large folios of compressed files
upstream in the upcoming one or two cycle, it's almost okay in the current
codebase.

I have to do more tests to ensure it doesn't break anything...

Thanks,
Gao Xiang

> nor f2fs supports large folios. So, yes, the page cache is full of small folios.
> 
>>
>> Also, do you have CONFIG_COMPACTION enabled?
>>
> 
> Yes. COMPACTION is enabled.
> 
> Thanks
> Barry
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-16  3:19     ` Gao Xiang
@ 2024-05-16  6:57       ` Barry Song
  2024-05-16  7:07         ` Gao Xiang
  0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2024-05-16  6:57 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Matthew Wilcox, lsf-pc, Linux-MM

On Thu, May 16, 2024 at 3:19 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>
>
> On 2024/5/16 08:25, Barry Song wrote:
> > On Thu, May 16, 2024 at 11:41 AM Matthew Wilcox <willy@infradead.org> wrote:
> >>
> >> On Fri, May 10, 2024 at 02:22:02PM +1200, Barry Song wrote:
> >>> - The allocation of mTHP isn't consistently reliable; even after
> >>> prolonged use, obtaining large folios
> >>>    remains uncertain.
> >>>    As an instance, following a few hours of operation, the likelihood
> >>> of successfully allocating large
> >>>    folios on a phone may decrease to just 2%.
> >>
> >> I'd be curious to know whether you're using a filesystem that supports
> >> large folios or whether the pagecache is full of small folios?  The more
> >> places that allocate large folios, the easier it becomes to allocate
> >> large folios.
> >
> > I am not using file systems with large folios as neither erofs(compressed files)
>
> Side note:  I will offically support large folios of compressed files
> upstream in the upcoming one or two cycle, it's almost okay in the current
> codebase.

Thanks for passing along this fantastic news! Feel free to reach me when you
send the patchset. I'm eager to delve into the code and run some tests.

>
> I have to do more tests to ensure it doesn't break anything...
>
> Thanks,
> Gao Xiang
>
> > nor f2fs supports large folios. So, yes, the page cache is full of small folios.
> >
> >>
> >> Also, do you have CONFIG_COMPACTION enabled?
> >>
> >
> > Yes. COMPACTION is enabled.
> >
Thanks
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-16  6:57       ` Barry Song
@ 2024-05-16  7:07         ` Gao Xiang
  0 siblings, 0 replies; 22+ messages in thread
From: Gao Xiang @ 2024-05-16  7:07 UTC (permalink / raw)
  To: Barry Song; +Cc: Matthew Wilcox, lsf-pc, Linux-MM



On 2024/5/16 14:57, Barry Song wrote:
> On Thu, May 16, 2024 at 3:19 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2024/5/16 08:25, Barry Song wrote:
>>> On Thu, May 16, 2024 at 11:41 AM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> On Fri, May 10, 2024 at 02:22:02PM +1200, Barry Song wrote:
>>>>> - The allocation of mTHP isn't consistently reliable; even after
>>>>> prolonged use, obtaining large folios
>>>>>     remains uncertain.
>>>>>     As an instance, following a few hours of operation, the likelihood
>>>>> of successfully allocating large
>>>>>     folios on a phone may decrease to just 2%.
>>>>
>>>> I'd be curious to know whether you're using a filesystem that supports
>>>> large folios or whether the pagecache is full of small folios?  The more
>>>> places that allocate large folios, the easier it becomes to allocate
>>>> large folios.
>>>
>>> I am not using file systems with large folios as neither erofs(compressed files)
>>
>> Side note:  I will offically support large folios of compressed files
>> upstream in the upcoming one or two cycle, it's almost okay in the current
>> codebase.
> 
> Thanks for passing along this fantastic news! Feel free to reach me when you
> send the patchset. I'm eager to delve into the code and run some tests.

Okay, I will cc you after I sort them out later.

Thanks,
Gao Xiang

> 
>>
>> I have to do more tests to ensure it doesn't break anything...
>>
>> Thanks,
>> Gao Xiang
>>
>>> nor f2fs supports large folios. So, yes, the page cache is full of small folios.
>>>
>>>>
>>>> Also, do you have CONFIG_COMPACTION enabled?
>>>>
>>>
>>> Yes. COMPACTION is enabled.
>>>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
  2024-05-10  2:22 [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation Barry Song
                   ` (2 preceding siblings ...)
  2024-05-15 23:41 ` Matthew Wilcox
@ 2024-05-22 21:43 ` David Hildenbrand
  3 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-05-22 21:43 UTC (permalink / raw)
  To: Barry Song, lsf-pc; +Cc: Linux-MM

On 10.05.24 04:22, Barry Song wrote:
> Hi,
> 
> I'd like to propose a session about the allocation and reclamation of
> mTHP. This is related to Yu Zhao's
> TAO[1] but not the same.
> 
> OPPO has implemented mTHP-like large folios across thousands of
> genuine Android devices, utilizing
> ARM64 CONT-PTE. However, we've encountered challenges:
> 
> - The allocation of mTHP isn't consistently reliable; even after
> prolonged use, obtaining large folios
>    remains uncertain.
>    As an instance, following a few hours of operation, the likelihood
> of successfully allocating large
>    folios on a phone may decrease to just 2%.
> 
> - Mixing large and small folios in the same LRU list can lead to
> mutual blocking and unpredictable
>    latency during reclamation/allocation.
> 
>    For instance, if you require large folios, the LRU list's tail could
> be filled with small folios.
>    LRU(LF- large folio, SF- small folio):
> 
>     LF - LF - LF -  SF - SF - SF - SF - SF - SF -SF - SF - SF - SF - SF - SF - SF

As we see more and more large folio users, with folios of differing 
sizes, I do wonder if the problem will shift from an order-0 vs. order>0 
to other orders. E.g., order-0 vs. order-4 vs. other orders.

Meaning: small vs. large folios will long term not be the real issue (in 
some configurations it will remain that for sure).

What's your take on that?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-05-22 21:43 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-10  2:22 [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation Barry Song
2024-05-10  2:31 ` Matthew Wilcox
2024-05-10  2:42   ` Barry Song
2024-05-10 14:25     ` [Lsf-pc] " Michal Hocko
2024-05-10 20:33       ` Yu Zhao
2024-05-15  2:42         ` Barry Song
2024-05-15 10:21           ` Karim Manaouil
2024-05-15 10:59           ` Yu Zhao
2024-05-15 13:50           ` Yang Shi
2024-05-15 18:14             ` Barry Song
2024-05-10 21:18 ` Yang Shi
2024-05-14  9:20   ` Barry Song
2024-05-15 13:49     ` Yang Shi
2024-05-15 19:25       ` Barry Song
2024-05-15 21:41         ` Yang Shi
2024-05-15 22:15           ` Barry Song
2024-05-15 23:41 ` Matthew Wilcox
2024-05-16  0:25   ` Barry Song
2024-05-16  3:19     ` Gao Xiang
2024-05-16  6:57       ` Barry Song
2024-05-16  7:07         ` Gao Xiang
2024-05-22 21:43 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox