Re: [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Usama Arif <usamaarif642@gmail.com>
To: Barry Song <21cnbao@gmail.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org, axboe@kernel.dk,
	bala.seshasayee@linux.intel.com, chrisl@kernel.org,
	david@redhat.com, hannes@cmpxchg.org,
	kanchana.p.sridhar@intel.com, kasong@tencent.com,
	linux-block@vger.kernel.org, minchan@kernel.org,
	nphamcs@gmail.com, ryan.roberts@arm.com,
	senozhatsky@chromium.org, surenb@google.com, terrelln@fb.com,
	v-songbaohua@oppo.com, wajdi.k.feghali@intel.com,
	willy@infradead.org, ying.huang@intel.com, yosryahmed@google.com,
	yuzhao@google.com, zhengtangquan@oppo.com,
	zhouchengming@bytedance.com, Chuanhua Han <chuanhuahan@gmail.com>
Subject: Re: [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails
Date: Mon, 25 Nov 2024 16:19:07 +0000	[thread overview]
Message-ID: <b6db556d-70e6-4adf-9ce1-d4e5af08e89c@gmail.com> (raw)
In-Reply-To: <CAGsJ_4wL=CgXdCt+2QC+aSKPh1873QyD_4ZkRSBniUipKX9AVA@mail.gmail.com>



On 24/11/2024 21:47, Barry Song wrote:
> On Sat, Nov 23, 2024 at 3:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 21/11/2024 22:25, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> The swapfile can compress/decompress at 4 * PAGES granularity, reducing
>>> CPU usage and improving the compression ratio. However, if allocating an
>>> mTHP fails and we fall back to a single small folio, the entire large
>>> block must still be decompressed. This results in a 16 KiB area requiring
>>> 4 page faults, where each fault decompresses 16 KiB but retrieves only
>>> 4 KiB of data from the block. To address this inefficiency, we instead
>>> fall back to 4 small folios, ensuring that each decompression occurs
>>> only once.
>>>
>>> Allowing swap_read_folio() to decompress and read into an array of
>>> 4 folios would be extremely complex, requiring extensive changes
>>> throughout the stack, including swap_read_folio, zeromap,
>>> zswap, and final swap implementations like zRAM. In contrast,
>>> having these components fill a large folio with 4 subpages is much
>>> simpler.
>>>
>>> To avoid a full-stack modification, we introduce a per-CPU order-2
>>> large folio as a buffer. This buffer is used for swap_read_folio(),
>>> after which the data is copied into the 4 small folios. Finally, in
>>> do_swap_page(), all these small folios are mapped.
>>>
>>> Co-developed-by: Chuanhua Han <chuanhuahan@gmail.com>
>>> Signed-off-by: Chuanhua Han <chuanhuahan@gmail.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>  mm/memory.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++---
>>>  1 file changed, 192 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 209885a4134f..e551570c1425 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4042,6 +4042,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
>>>       return folio;
>>>  }
>>>
>>> +#define BATCH_SWPIN_ORDER 2
>>
>> Hi Barry,
>>
>> Thanks for the series and the numbers in the cover letter.
>>
>> Just a few things.
>>
>> Should BATCH_SWPIN_ORDER be ZSMALLOC_MULTI_PAGES_ORDER instead of 2?
> 
> Technically, yes. I'm also considering removing ZSMALLOC_MULTI_PAGES_ORDER
> and always setting it to 2, which is the minimum anonymous mTHP order.  The main
> reason is that it may be difficult for users to select the appropriate Kconfig?
> 
> On the other hand, 16KB provides the most advantages for zstd compression and
> decompression with larger blocks. While increasing from 16KB to 32KB or 64KB
> can offer additional benefits, the improvement is not as significant
> as the jump from
> 4KB to 16KB.
> 
> As I use zstd to compress and decompress the 'Beyond Compare' software
> package:
> 
> root@barry-desktop:~# ./zstd
> File size: 182502912 bytes
> 4KB Block: Compression time = 0.765915 seconds, Decompression time =
> 0.203366 seconds
>   Original size: 182502912 bytes
>   Compressed size: 66089193 bytes
>   Compression ratio: 36.21%
> 16KB Block: Compression time = 0.558595 seconds, Decompression time =
> 0.153837 seconds
>   Original size: 182502912 bytes
>   Compressed size: 59159073 bytes
>   Compression ratio: 32.42%
> 32KB Block: Compression time = 0.538106 seconds, Decompression time =
> 0.137768 seconds
>   Original size: 182502912 bytes
>   Compressed size: 57958701 bytes
>   Compression ratio: 31.76%
> 64KB Block: Compression time = 0.532212 seconds, Decompression time =
> 0.127592 seconds
>   Original size: 182502912 bytes
>   Compressed size: 56700795 bytes
>   Compression ratio: 31.07%
> 
> In that case, would we no longer need to rely on ZSMALLOC_MULTI_PAGES_ORDER?
> 

Yes, I think if there isn't a very significant benefit of using a larger order,
then its better not to have this option. It would also simplify the code.

>>
>> Did you check the performance difference with and without patch 4?
> 
> I retested after reverting patch 4, and the sys time increased to over
> 40 minutes
> again, though it was slightly better than without the entire series.
> 
> *** Executing round 1 ***
> 
> real 7m49.342s
> user 80m53.675s
> sys 42m28.393s
> pswpin: 29965548
> pswpout: 51127359
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 11347712
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 6641230
> pgpgin: 147376000
> pgpgout: 213343124
> 
> *** Executing round 2 ***
> 
> real 7m41.331s
> user 81m16.631s
> sys 41m39.845s
> pswpin: 29208867
> pswpout: 50006026
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 11104912
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 6483827
> pgpgin: 144057340
> pgpgout: 208887688
> 
> 
> *** Executing round 3 ***
> 
> real 7m47.280s
> user 78m36.767s
> sys 37m32.210s
> pswpin: 26426526
> pswpout: 45420734
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 10104304
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 5884839
> pgpgin: 132013648
> pgpgout: 190537264
> 
> *** Executing round 4 ***
> 
> real 7m56.723s
> user 80m36.837s
> sys 41m35.979s
> pswpin: 29367639
> pswpout: 50059254
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 11116176
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 6514064
> pgpgin: 144593828
> pgpgout: 209080468
> 
> *** Executing round 5 ***
> 
> real 7m53.806s
> user 80m30.953s
> sys 40m14.870s
> pswpin: 28091760
> pswpout: 48495748
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 10779720
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 6244819
> pgpgin: 138813124
> pgpgout: 202885480
> 
> I guess it is due to the occurrence of numerous partial reads
> (about 10%, 3505537/35159852).
> 
> root@barry-desktop:~# cat /sys/block/zram0/multi_pages_debug_stat
> 
> zram_bio write/read multi_pages count:54452828 35159852
> zram_bio failed write/read multi_pages count       0        0
> zram_bio partial write/read multi_pages count       4  3505537
> multi_pages_miss_free        0
> 
> This workload doesn't cause fragmentation in the buddy allocator, so it’s
> likely due to the failure of MEMCG_CHARGE.
> 
>>
>> I know that it wont help if you have a lot of unmovable pages
>> scattered everywhere, but were you able to compare the performance
>> of defrag=always vs patch 4? I feel like if you have space for 4 folios
>> then hopefully compaction should be able to do its job and you can
>> directly fill the large folio if the unmovable pages are better placed.
>> Johannes' series on preventing type mixing [1] would help.
>>
>> [1] https://lore.kernel.org/all/20240320180429.678181-1-hannes@cmpxchg.org/
> 
> I believe this could help, but defragmentation is a complex issue. Especially on
> phones, where various components like drivers, DMA-BUF, multimedia, and
> graphics utilize memory.
> 
> We observed that a fresh system could initially provide mTHP, but after a few
> hours, obtaining mTHP became very challenging. I'm happy to arrange a test
> of Johannes' series on phones (sometimes it is quite hard to backport to the
> Android kernel) to see if it brings any improvements.
> 

I think its definitely worth trying. If we can improve memory allocation/compaction
instead of patch 4, then we should go for that. Maybe there won't be a need for TAO
if allocation is done in a smarter way?

Just out of curiosity, what is the base kernel version you are testing with?

Thanks,
Usama

next prev parent reply	other threads:[~2024-11-25 16:19 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-21 22:25 [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
2024-11-21 22:25 ` [PATCH RFC v3 1/4] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
2024-11-26  5:37   ` Sergey Senozhatsky
2024-11-27  1:53     ` Barry Song
2024-11-21 22:25 ` [PATCH RFC v3 2/4] zram: support compression at the granularity of multi-pages Barry Song
2024-11-21 22:25 ` [PATCH RFC v3 3/4] zram: backend_zstd: Adjust estimated_src_size to accommodate multi-page compression Barry Song
2024-11-21 22:25 ` [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails Barry Song
2024-11-22 14:54   ` Usama Arif
2024-11-24 21:47     ` Barry Song
2024-11-25 16:19       ` Usama Arif [this message]
2024-11-25 18:32         ` Barry Song
2024-11-26  5:09 ` [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Sergey Senozhatsky
2024-11-26 10:52   ` Sergey Senozhatsky
2024-11-26 20:31     ` Barry Song
2024-11-27  5:04       ` Sergey Senozhatsky
2024-11-28 20:56         ` Barry Song
2024-11-26 20:20   ` Barry Song
2024-11-27  4:52     ` Sergey Senozhatsky
2024-11-28 20:40       ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b6db556d-70e6-4adf-9ce1-d4e5af08e89c@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=bala.seshasayee@linux.intel.com \
    --cc=chrisl@kernel.org \
    --cc=chuanhuahan@gmail.com \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=kasong@tencent.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=senozhatsky@chromium.org \
    --cc=surenb@google.com \
    --cc=terrelln@fb.com \
    --cc=v-songbaohua@oppo.com \
    --cc=wajdi.k.feghali@intel.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    --cc=yuzhao@google.com \
    --cc=zhengtangquan@oppo.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox