* [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in
@ 2024-12-12 7:37 Barry Song
2024-12-12 8:29 ` Christoph Hellwig
0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2024-12-12 7:37 UTC (permalink / raw)
To: akpm, linux-mm
Cc: axboe, bala.seshasayee, chrisl, david, hannes,
kanchana.p.sridhar, kasong, nphamcs, ryan.roberts, senozhatsky,
terrelln, usamaarif642, v-songbaohua, wajdi.k.feghali, willy,
ying.huang, yosryahmed, baolin.wang
From: Barry Song <v-songbaohua@oppo.com>
While developing the zeromap series, Usama observed that certain
workloads may contain over 10% zero-filled pages. This may present
an opportunity to save memory by mapping zero-filled pages to zero_pfn
in do_swap_page(). If a write occurs later, do_wp_page() can
allocate a new page using the Copy-on-Write mechanism.
For workloads with numerous zero-filled pages, this can greatly
reduce the RSS.
For example:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#define SIZE (20 * 1024 * 1024)
int main()
{
volatile char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
volatile char data;
if (buffer == MAP_FAILED) {
perror("mmap failed");
exit(EXIT_FAILURE);
}
memset(buffer, 0, SIZE);
if (madvise(buffer, SIZE, MADV_PAGEOUT) != 0)
perror("madvise MADV_PAGEOUT failed");
for (size_t i = 0; i < SIZE; i++)
data = buffer[i];
sleep(1000);
return 0;
}
~ # ./a.out &
w/o patch:
~ # ps aux | head -n 1; ps aux | grep '[a]\.out'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 101 2.9 10.6 22540 21268 ttyAMA0 S 06:50 0:00 ./a.out
w/ patch:
~ # ps aux | head -n 1; ps aux | grep '[a]\.out'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 141 0.1 0.3 22540 792 ttyAMA0 S 06:38 0:00 ./a.out
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
mm/memory.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 2bacebbf4cf6..b37f0f61d0bc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4294,6 +4294,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
struct swap_info_struct *si = NULL;
rmap_t rmap_flags = RMAP_NONE;
bool need_clear_cache = false;
+ bool map_zero_pfn = false;
bool exclusive = false;
swp_entry_t entry;
pte_t pte;
@@ -4364,6 +4365,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
swapcache = folio;
if (!folio) {
+ /* Use the zero-page for reads */
+ if (!(vmf->flags & FAULT_FLAG_WRITE) &&
+ !mm_forbids_zeropage(vma->vm_mm) &&
+ __swap_count(entry) == 1) {
+ swap_zeromap_batch(entry, 1, &map_zero_pfn);
+ if (map_zero_pfn) {
+ if (swapcache_prepare(entry, 1)) {
+ add_wait_queue(&swapcache_wq, &wait);
+ schedule_timeout_uninterruptible(1);
+ remove_wait_queue(&swapcache_wq, &wait);
+ goto out;
+ }
+ nr_pages = 1;
+ need_clear_cache = true;
+ pte = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
+ vma->vm_page_prot));
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
+ &vmf->ptl);
+ if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte),
+ vmf->orig_pte)))
+ goto unlock;
+
+ page = pfn_to_page(my_zero_pfn(vmf->address));
+ arch_swap_restore(entry, page_folio(page));
+ swap_free_nr(entry, 1);
+ add_mm_counter(vma->vm_mm, MM_SWAPENTS, -1);
+ set_ptes(vma->vm_mm, vmf->address, vmf->pte, pte, 1);
+ arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address, pte, pte, 1);
+ update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+ goto unlock;
+ }
+ }
+
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1) {
/* skip swapcache */
--
2.39.3 (Apple Git-146)
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in 2024-12-12 7:37 [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in Barry Song @ 2024-12-12 8:29 ` Christoph Hellwig 2024-12-12 8:46 ` Barry Song 0 siblings, 1 reply; 10+ messages in thread From: Christoph Hellwig @ 2024-12-12 8:29 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes, kanchana.p.sridhar, kasong, nphamcs, ryan.roberts, senozhatsky, terrelln, usamaarif642, v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed, baolin.wang On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > While developing the zeromap series, Usama observed that certain > workloads may contain over 10% zero-filled pages. This may present > an opportunity to save memory by mapping zero-filled pages to zero_pfn > in do_swap_page(). If a write occurs later, do_wp_page() can > allocate a new page using the Copy-on-Write mechanism. Shouldn't this be done during, or rather instead of swap out instead? Swapping all zero pages out just to optimize the in-memory representation on seems rather backwards. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in 2024-12-12 8:29 ` Christoph Hellwig @ 2024-12-12 8:46 ` Barry Song 2024-12-12 8:50 ` Christoph Hellwig 2024-12-12 8:50 ` David Hildenbrand 0 siblings, 2 replies; 10+ messages in thread From: Barry Song @ 2024-12-12 8:46 UTC (permalink / raw) To: Christoph Hellwig Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes, kanchana.p.sridhar, kasong, nphamcs, ryan.roberts, senozhatsky, terrelln, usamaarif642, v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed, baolin.wang On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > While developing the zeromap series, Usama observed that certain > > workloads may contain over 10% zero-filled pages. This may present > > an opportunity to save memory by mapping zero-filled pages to zero_pfn > > in do_swap_page(). If a write occurs later, do_wp_page() can > > allocate a new page using the Copy-on-Write mechanism. > > Shouldn't this be done during, or rather instead of swap out instead? > Swapping all zero pages out just to optimize the in-memory > representation on seems rather backwards. I’m having trouble understanding your point—it seems like you might not have fully read the code. :-) The situation is as follows: for a zero-filled page, we are currently allocating a new page unconditionally. By mapping this zero-filled page to zero_pfn, we could save the memory used by this page. We don't need to allocate the memory until the page is written(which may never happen). > Thanks Barry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in 2024-12-12 8:46 ` Barry Song @ 2024-12-12 8:50 ` Christoph Hellwig 2024-12-12 8:54 ` Barry Song 2024-12-12 8:50 ` David Hildenbrand 1 sibling, 1 reply; 10+ messages in thread From: Christoph Hellwig @ 2024-12-12 8:50 UTC (permalink / raw) To: Barry Song Cc: Christoph Hellwig, akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes, kanchana.p.sridhar, kasong, nphamcs, ryan.roberts, senozhatsky, terrelln, usamaarif642, v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed, baolin.wang On Thu, Dec 12, 2024 at 09:46:03PM +1300, Barry Song wrote: > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > While developing the zeromap series, Usama observed that certain > > > workloads may contain over 10% zero-filled pages. This may present > > > an opportunity to save memory by mapping zero-filled pages to zero_pfn > > > in do_swap_page(). If a write occurs later, do_wp_page() can > > > allocate a new page using the Copy-on-Write mechanism. > > > > Shouldn't this be done during, or rather instead of swap out instead? > > Swapping all zero pages out just to optimize the in-memory > > representation on seems rather backwards. > > I’m having trouble understanding your point—it seems like you might > not have fully read the code. :-) I've not read the code at all, I've read your commit log. > The situation is as follows: for a zero-filled page, we are currently > allocating a new > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > save the memory used by this page. Yes. But why do that in swap-in and not swap-out? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in 2024-12-12 8:50 ` Christoph Hellwig @ 2024-12-12 8:54 ` Barry Song 0 siblings, 0 replies; 10+ messages in thread From: Barry Song @ 2024-12-12 8:54 UTC (permalink / raw) To: Christoph Hellwig Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes, kanchana.p.sridhar, kasong, nphamcs, ryan.roberts, senozhatsky, terrelln, usamaarif642, v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed, baolin.wang On Thu, Dec 12, 2024 at 9:50 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Dec 12, 2024 at 09:46:03PM +1300, Barry Song wrote: > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > While developing the zeromap series, Usama observed that certain > > > > workloads may contain over 10% zero-filled pages. This may present > > > > an opportunity to save memory by mapping zero-filled pages to zero_pfn > > > > in do_swap_page(). If a write occurs later, do_wp_page() can > > > > allocate a new page using the Copy-on-Write mechanism. > > > > > > Shouldn't this be done during, or rather instead of swap out instead? > > > Swapping all zero pages out just to optimize the in-memory > > > representation on seems rather backwards. > > > > I’m having trouble understanding your point—it seems like you might > > not have fully read the code. :-) > > I've not read the code at all, I've read your commit log. > > > The situation is as follows: for a zero-filled page, we are currently > > allocating a new > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > save the memory used by this page. > > Yes. But why do that in swap-in and not swap-out? Usama implemented this in swap-out, where no I/O occurs after his zeromap series. A bit is set in the swap->zeromap bitmap if the swapped-out page is zero-filled. and all swapp-out I/O is skipped. Now, the situation is that when we re-access a swapped-out page, we don’t always need to allocate a new page. Instead, we can map it to zero_pfn and defer the allocation until the page is written. > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in 2024-12-12 8:46 ` Barry Song 2024-12-12 8:50 ` Christoph Hellwig @ 2024-12-12 8:50 ` David Hildenbrand 2024-12-12 9:16 ` Barry Song 1 sibling, 1 reply; 10+ messages in thread From: David Hildenbrand @ 2024-12-12 8:50 UTC (permalink / raw) To: Barry Song, Christoph Hellwig Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, hannes, kanchana.p.sridhar, kasong, nphamcs, ryan.roberts, senozhatsky, terrelln, usamaarif642, v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed, baolin.wang On 12.12.24 09:46, Barry Song wrote: > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: >> >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> While developing the zeromap series, Usama observed that certain >>> workloads may contain over 10% zero-filled pages. This may present >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn >>> in do_swap_page(). If a write occurs later, do_wp_page() can >>> allocate a new page using the Copy-on-Write mechanism. >> >> Shouldn't this be done during, or rather instead of swap out instead? >> Swapping all zero pages out just to optimize the in-memory >> representation on seems rather backwards. > > I’m having trouble understanding your point—it seems like you might > not have fully read the code. :-) > > The situation is as follows: for a zero-filled page, we are currently > allocating a new > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > save the memory used by this page. > > We don't need to allocate the memory until the page is written(which may never > happen). I think what Christoph means is that you would determine that at PTE unmap time, and directly place the zero page in there. So there would be no need to have the page fault at all. I suspect at PTE unmap time might be problematic, because we might still have other (i.e., GUP) references modifying that page, and we can only rely on the page content being stable after we flushed the TLB as well. (I recall some deferred flushing optimizations) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in 2024-12-12 8:50 ` David Hildenbrand @ 2024-12-12 9:16 ` Barry Song 2024-12-12 16:25 ` Johannes Weiner 0 siblings, 1 reply; 10+ messages in thread From: Barry Song @ 2024-12-12 9:16 UTC (permalink / raw) To: David Hildenbrand Cc: Christoph Hellwig, akpm, linux-mm, axboe, bala.seshasayee, chrisl, hannes, kanchana.p.sridhar, kasong, nphamcs, ryan.roberts, senozhatsky, terrelln, usamaarif642, v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed, baolin.wang On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > On 12.12.24 09:46, Barry Song wrote: > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > >> > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > >>> From: Barry Song <v-songbaohua@oppo.com> > >>> > >>> While developing the zeromap series, Usama observed that certain > >>> workloads may contain over 10% zero-filled pages. This may present > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn > >>> in do_swap_page(). If a write occurs later, do_wp_page() can > >>> allocate a new page using the Copy-on-Write mechanism. > >> > >> Shouldn't this be done during, or rather instead of swap out instead? > >> Swapping all zero pages out just to optimize the in-memory > >> representation on seems rather backwards. > > > > I’m having trouble understanding your point—it seems like you might > > not have fully read the code. :-) > > > > The situation is as follows: for a zero-filled page, we are currently > > allocating a new > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > save the memory used by this page. > > > > We don't need to allocate the memory until the page is written(which may never > > happen). > > I think what Christoph means is that you would determine that at PTE > unmap time, and directly place the zero page in there. So there would be > no need to have the page fault at all. > > I suspect at PTE unmap time might be problematic, because we might still > have other (i.e., GUP) references modifying that page, and we can only > rely on the page content being stable after we flushed the TLB as well. > (I recall some deferred flushing optimizations) Yes, we need to follow a strict sequence: 1. try_to_unmap - unmap PTEs in all processes; 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown; 3. pageout - zeromap will set 1 in bitmap if page is zero-filled At the moment of pageout(), we can be confident that the page is zero-filled. mapping to zeropage during unmap seems quite risky. > > -- > Cheers, > > David / dhildenb > Thanks Barry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in 2024-12-12 9:16 ` Barry Song @ 2024-12-12 16:25 ` Johannes Weiner 2024-12-13 1:47 ` Barry Song 0 siblings, 1 reply; 10+ messages in thread From: Johannes Weiner @ 2024-12-12 16:25 UTC (permalink / raw) To: Barry Song Cc: David Hildenbrand, Christoph Hellwig, akpm, linux-mm, axboe, bala.seshasayee, chrisl, kanchana.p.sridhar, kasong, nphamcs, ryan.roberts, senozhatsky, terrelln, usamaarif642, v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed, baolin.wang On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote: > On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > > > On 12.12.24 09:46, Barry Song wrote: > > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > >> > > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > >>> From: Barry Song <v-songbaohua@oppo.com> > > >>> > > >>> While developing the zeromap series, Usama observed that certain > > >>> workloads may contain over 10% zero-filled pages. This may present > > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn > > >>> in do_swap_page(). If a write occurs later, do_wp_page() can > > >>> allocate a new page using the Copy-on-Write mechanism. > > >> > > >> Shouldn't this be done during, or rather instead of swap out instead? > > >> Swapping all zero pages out just to optimize the in-memory > > >> representation on seems rather backwards. > > > > > > I’m having trouble understanding your point—it seems like you might > > > not have fully read the code. :-) > > > > > > The situation is as follows: for a zero-filled page, we are currently > > > allocating a new > > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > > save the memory used by this page. > > > > > > We don't need to allocate the memory until the page is written(which may never > > > happen). > > > > I think what Christoph means is that you would determine that at PTE > > unmap time, and directly place the zero page in there. So there would be > > no need to have the page fault at all. > > > > I suspect at PTE unmap time might be problematic, because we might still > > have other (i.e., GUP) references modifying that page, and we can only > > rely on the page content being stable after we flushed the TLB as well. > > (I recall some deferred flushing optimizations) > > Yes, we need to follow a strict sequence: > > 1. try_to_unmap - unmap PTEs in all processes; > 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown; > 3. pageout - zeromap will set 1 in bitmap if page is zero-filled > > At the moment of pageout(), we can be confident that the page is zero-filled. > > mapping to zeropage during unmap seems quite risky. You have to unmap and flush to stop modifications, but I think not in all processes before it's safe to decide. Shared anon pages have COW semantics; when you enter try_to_unmap() with a page and rmap gives you a pte, it's one of these: a) never forked, no sibling ptes b) cow broken into private copy, no sibling ptes c) cow/WP; any writes to this or another pte will go to a new page. In cases a and b you need to unmap and flush the current pte, but then it's safe to check contents and set the zero pte right away, even before finishing the rmap walk. In case c, modifications to the page are impossible due to WP, so you don't even need to unmap and flush before checking the contents. The pte lock holds up COW breaking to a new page until you're done. It's definitely more complicated than the current implementation, but if it can be made to work, we could get rid of the bitmap. You might also reduce faults, but I'm a bit skeptical. Presumably zerofilled regions are mostly considered invalid by the application, not useful data, so a populating write that will cowbreak seems more likely to happen next than a faultless read from the zeropage. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in 2024-12-12 16:25 ` Johannes Weiner @ 2024-12-13 1:47 ` Barry Song 2024-12-13 2:27 ` Barry Song 0 siblings, 1 reply; 10+ messages in thread From: Barry Song @ 2024-12-13 1:47 UTC (permalink / raw) To: hannes Cc: 21cnbao, akpm, axboe, bala.seshasayee, baolin.wang, chrisl, david, hch, kanchana.p.sridhar, kasong, linux-mm, nphamcs, ryan.roberts, senozhatsky, terrelln, usamaarif642, v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed On Fri, Dec 13, 2024 at 5:25 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote: > > On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > On 12.12.24 09:46, Barry Song wrote: > > > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > > >> > > > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > > >>> From: Barry Song <v-songbaohua@oppo.com> > > > >>> > > > >>> While developing the zeromap series, Usama observed that certain > > > >>> workloads may contain over 10% zero-filled pages. This may present > > > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn > > > >>> in do_swap_page(). If a write occurs later, do_wp_page() can > > > >>> allocate a new page using the Copy-on-Write mechanism. > > > >> > > > >> Shouldn't this be done during, or rather instead of swap out instead? > > > >> Swapping all zero pages out just to optimize the in-memory > > > >> representation on seems rather backwards. > > > > > > > > I’m having trouble understanding your point—it seems like you might > > > > not have fully read the code. :-) > > > > > > > > The situation is as follows: for a zero-filled page, we are currently > > > > allocating a new > > > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > > > save the memory used by this page. > > > > > > > > We don't need to allocate the memory until the page is written(which may never > > > > happen). > > > > > > I think what Christoph means is that you would determine that at PTE > > > unmap time, and directly place the zero page in there. So there would be > > > no need to have the page fault at all. > > > > > > I suspect at PTE unmap time might be problematic, because we might still > > > have other (i.e., GUP) references modifying that page, and we can only > > > rely on the page content being stable after we flushed the TLB as well. > > > (I recall some deferred flushing optimizations) > > > > Yes, we need to follow a strict sequence: > > > > 1. try_to_unmap - unmap PTEs in all processes; > > 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown; > > 3. pageout - zeromap will set 1 in bitmap if page is zero-filled > > > > At the moment of pageout(), we can be confident that the page is zero-filled. > > > > mapping to zeropage during unmap seems quite risky. > > You have to unmap and flush to stop modifications, but I think not in > all processes before it's safe to decide. Shared anon pages have COW > semantics; when you enter try_to_unmap() with a page and rmap gives > you a pte, it's one of these: > > a) never forked, no sibling ptes > b) cow broken into private copy, no sibling ptes > c) cow/WP; any writes to this or another pte will go to a new page. > > In cases a and b you need to unmap and flush the current pte, but then > it's safe to check contents and set the zero pte right away, even > before finishing the rmap walk. > > In case c, modifications to the page are impossible due to WP, so you > don't even need to unmap and flush before checking the contents. The > pte lock holds up COW breaking to a new page until you're done. > > It's definitely more complicated than the current implementation, but > if it can be made to work, we could get rid of the bitmap. > > You might also reduce faults, but I'm a bit skeptical. Presumably > zerofilled regions are mostly considered invalid by the application, > not useful data, so a populating write that will cowbreak seems more > likely to happen next than a faultless read from the zeropage. Yes. That is right. I created the following debug patch to count the proportional distribution of zero_swpin reads, as well as the comparison between zero_swpin and zero_swpout: diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index f70d0958095c..ed9d1a6cc565 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -136,6 +136,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, SWAP_RA_HIT, SWPIN_ZERO, SWPOUT_ZERO, + SWPIN_ZERO_READ, #ifdef CONFIG_KSM KSM_SWPIN_COPY, #endif diff --git a/mm/memory.c b/mm/memory.c index f3040c69f648..3aacfbe7bd77 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4400,6 +4400,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* Count SWPIN_ZERO since page_io was skipped */ objcg = get_obj_cgroup_from_swap(entry); count_vm_events(SWPIN_ZERO, 1); + count_vm_events(SWPIN_ZERO_READ, 1); if (objcg) { count_objcg_events(objcg, SWPIN_ZERO, 1); obj_cgroup_put(objcg); diff --git a/mm/vmstat.c b/mm/vmstat.c index 4d016314a56c..9465fe9bda9e 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1420,6 +1420,7 @@ const char * const vmstat_text[] = { "swap_ra_hit", "swpin_zero", "swpout_zero", + "swpin_zero_read", #ifdef CONFIG_KSM "ksm_swpin_copy", #endif For a kernel-build workload in a single memcg with only 1GB of memory, use the script below: #!/bin/bash echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled echo never > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled vmstat_path="/proc/vmstat" thp_base_path="/sys/kernel/mm/transparent_hugepage" read_values() { pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') swpin_zero_read=$(grep "swpin_zero_read" $vmstat_path | awk '{print $2}') echo "$pswpin $pswpout $pgpgin $pgpgout $swpout_zero $swpin_zero $swpin_zero_read" } for ((i=1; i<=5; i++)) do echo echo "*** Executing round $i ***" make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null sync; echo 3 > /proc/sys/vm/drop_caches; sleep 1 #kernel build initial_values=($(read_values)) time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null final_values=($(read_values)) echo "pswpin: $((final_values[0] - initial_values[0]))" echo "pswpout: $((final_values[1] - initial_values[1]))" echo "pgpgin: $((final_values[2] - initial_values[2]))" echo "pgpgout: $((final_values[3] - initial_values[3]))" echo "swpout_zero: $((final_values[4] - initial_values[4]))" echo "swpin_zero: $((final_values[5] - initial_values[5]))" echo "swpin_zero_read: $((final_values[6] - initial_values[6]))" done The results I am seeing are as follows: real 6m43.998s user 47m3.800s sys 5m7.169s pswpin: 342041 pswpout: 1470846 pgpgin: 11744932 pgpgout: 14466564 swpout_zero: 318030 swpin_zero: 93621 swpin_zero_read: 13118 The proportion of zero_swpout is quite large (> 10%): 318,030 vs. 1,470,846. The percentage is 17.8% = 318,030 / (318,030 + 1,470,846). About 29.4% (93,621 / 318,030) of these will be swapped in, and 14% of those zero_swpin pages are read (13,118 / 93,621). Therefore, a total of 17.8% * 29.4% * 14% = 0.73% of all swapped-out pages will be re-mapped to zero_pfn, potentially saving up to 0.73% RSS in this kernel-build workload. Thus, the total build time of my final results falls within the testing jitter range, showing no noticeable difference while the conceptual model code with lots of zero-filled pages and read swap-in shows significant differences. I'm not sure if we can identify another real workload with more read swpin to observe noticeable improvements. Perhaps Usama has some? Thanks Barry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in 2024-12-13 1:47 ` Barry Song @ 2024-12-13 2:27 ` Barry Song 0 siblings, 0 replies; 10+ messages in thread From: Barry Song @ 2024-12-13 2:27 UTC (permalink / raw) To: hannes Cc: akpm, axboe, bala.seshasayee, baolin.wang, chrisl, david, hch, kanchana.p.sridhar, kasong, linux-mm, nphamcs, ryan.roberts, senozhatsky, terrelln, usamaarif642, v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed On Fri, Dec 13, 2024 at 2:47 PM Barry Song <21cnbao@gmail.com> wrote: > > On Fri, Dec 13, 2024 at 5:25 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote: > > > On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > > > On 12.12.24 09:46, Barry Song wrote: > > > > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > >> > > > > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > > > >>> From: Barry Song <v-songbaohua@oppo.com> > > > > >>> > > > > >>> While developing the zeromap series, Usama observed that certain > > > > >>> workloads may contain over 10% zero-filled pages. This may present > > > > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn > > > > >>> in do_swap_page(). If a write occurs later, do_wp_page() can > > > > >>> allocate a new page using the Copy-on-Write mechanism. > > > > >> > > > > >> Shouldn't this be done during, or rather instead of swap out instead? > > > > >> Swapping all zero pages out just to optimize the in-memory > > > > >> representation on seems rather backwards. > > > > > > > > > > I’m having trouble understanding your point—it seems like you might > > > > > not have fully read the code. :-) > > > > > > > > > > The situation is as follows: for a zero-filled page, we are currently > > > > > allocating a new > > > > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > > > > save the memory used by this page. > > > > > > > > > > We don't need to allocate the memory until the page is written(which may never > > > > > happen). > > > > > > > > I think what Christoph means is that you would determine that at PTE > > > > unmap time, and directly place the zero page in there. So there would be > > > > no need to have the page fault at all. > > > > > > > > I suspect at PTE unmap time might be problematic, because we might still > > > > have other (i.e., GUP) references modifying that page, and we can only > > > > rely on the page content being stable after we flushed the TLB as well. > > > > (I recall some deferred flushing optimizations) > > > > > > Yes, we need to follow a strict sequence: > > > > > > 1. try_to_unmap - unmap PTEs in all processes; > > > 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown; > > > 3. pageout - zeromap will set 1 in bitmap if page is zero-filled > > > > > > At the moment of pageout(), we can be confident that the page is zero-filled. > > > > > > mapping to zeropage during unmap seems quite risky. > > > > You have to unmap and flush to stop modifications, but I think not in > > all processes before it's safe to decide. Shared anon pages have COW > > semantics; when you enter try_to_unmap() with a page and rmap gives > > you a pte, it's one of these: > > > > a) never forked, no sibling ptes > > b) cow broken into private copy, no sibling ptes > > c) cow/WP; any writes to this or another pte will go to a new page. > > > > In cases a and b you need to unmap and flush the current pte, but then > > it's safe to check contents and set the zero pte right away, even > > before finishing the rmap walk. > > > > In case c, modifications to the page are impossible due to WP, so you > > don't even need to unmap and flush before checking the contents. The > > pte lock holds up COW breaking to a new page until you're done. > > > > It's definitely more complicated than the current implementation, but > > if it can be made to work, we could get rid of the bitmap. > > > > You might also reduce faults, but I'm a bit skeptical. Presumably > > zerofilled regions are mostly considered invalid by the application, > > not useful data, so a populating write that will cowbreak seems more > > likely to happen next than a faultless read from the zeropage. > > Yes. That is right. > > I created the following debug patch to count the proportional distribution > of zero_swpin reads, as well as the comparison between zero_swpin and > zero_swpout: > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index f70d0958095c..ed9d1a6cc565 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -136,6 +136,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > SWAP_RA_HIT, > SWPIN_ZERO, > SWPOUT_ZERO, > + SWPIN_ZERO_READ, > #ifdef CONFIG_KSM > KSM_SWPIN_COPY, > #endif > diff --git a/mm/memory.c b/mm/memory.c > index f3040c69f648..3aacfbe7bd77 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4400,6 +4400,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > /* Count SWPIN_ZERO since page_io was skipped */ > objcg = get_obj_cgroup_from_swap(entry); > count_vm_events(SWPIN_ZERO, 1); > + count_vm_events(SWPIN_ZERO_READ, 1); > if (objcg) { > count_objcg_events(objcg, SWPIN_ZERO, 1); > obj_cgroup_put(objcg); > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 4d016314a56c..9465fe9bda9e 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1420,6 +1420,7 @@ const char * const vmstat_text[] = { > "swap_ra_hit", > "swpin_zero", > "swpout_zero", > + "swpin_zero_read", > #ifdef CONFIG_KSM > "ksm_swpin_copy", > #endif > > > For a kernel-build workload in a single memcg with only 1GB of memory, use > the script below: > > #!/bin/bash > > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > vmstat_path="/proc/vmstat" > thp_base_path="/sys/kernel/mm/transparent_hugepage" > > read_values() { > pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') > pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') > pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') > pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') > swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}') > swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}') > swpin_zero_read=$(grep "swpin_zero_read" $vmstat_path | awk '{print $2}') > > echo "$pswpin $pswpout $pgpgin $pgpgout $swpout_zero $swpin_zero $swpin_zero_read" > } > > for ((i=1; i<=5; i++)) > do > echo > echo "*** Executing round $i ***" > make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null > sync; echo 3 > /proc/sys/vm/drop_caches; sleep 1 > #kernel build > initial_values=($(read_values)) > time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null > final_values=($(read_values)) > > echo "pswpin: $((final_values[0] - initial_values[0]))" > echo "pswpout: $((final_values[1] - initial_values[1]))" > echo "pgpgin: $((final_values[2] - initial_values[2]))" > echo "pgpgout: $((final_values[3] - initial_values[3]))" > echo "swpout_zero: $((final_values[4] - initial_values[4]))" > echo "swpin_zero: $((final_values[5] - initial_values[5]))" > echo "swpin_zero_read: $((final_values[6] - initial_values[6]))" > done > > > The results I am seeing are as follows: > > real 6m43.998s > user 47m3.800s > sys 5m7.169s > pswpin: 342041 > pswpout: 1470846 > pgpgin: 11744932 > pgpgout: 14466564 > swpout_zero: 318030 > swpin_zero: 93621 > swpin_zero_read: 13118 > > The proportion of zero_swpout is quite large (> 10%): 318,030 vs. 1,470,846. > The percentage is 17.8% = 318,030 / (318,030 + 1,470,846). > > About 29.4% (93,621 / 318,030) of these will be swapped in, and 14% of those > zero_swpin pages are read (13,118 / 93,621). > > Therefore, a total of 17.8% * 29.4% * 14% = 0.73% of all swapped-out pages > will be re-mapped to zero_pfn, potentially saving up to 0.73% RSS in this > kernel-build workload. Thus, the total build time of my final results falls Apologies for the mistake in my math. I shouldn't have used swpout as the denominator and swpin as the numerator. Instead, both the numerator and denominator should be based on swpin. Potentially, 13,118 swpin_zero_read / (342,041 pswpin + 13,118 swpin_zero_read) could be saved for swap-in, meaning 3.7% of all swap-ins can be saved using zero_pfn. > within the testing jitter range, showing no noticeable difference while > the conceptual model code with lots of zero-filled pages and read swap-in > shows significant differences. Although 3.7% of swap-ins can be saved, my X86 PC is too weak to demonstrate the differences, as numerous factors—such as temperature and unstable I/O latency—can impact the final build time. Hopefully, others can share test results conducted on stable hardware and more workloads. > > I'm not sure if we can identify another real workload with more read swpin > to observe noticeable improvements. Perhaps Usama has some? > Thanks Barry ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-12-13 2:27 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-12-12 7:37 [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in Barry Song 2024-12-12 8:29 ` Christoph Hellwig 2024-12-12 8:46 ` Barry Song 2024-12-12 8:50 ` Christoph Hellwig 2024-12-12 8:54 ` Barry Song 2024-12-12 8:50 ` David Hildenbrand 2024-12-12 9:16 ` Barry Song 2024-12-12 16:25 ` Johannes Weiner 2024-12-13 1:47 ` Barry Song 2024-12-13 2:27 ` Barry Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox