Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Barry Song <21cnbao@gmail.com>
To: hannes@cmpxchg.org
Cc: 21cnbao@gmail.com, akpm@linux-foundation.org, axboe@kernel.dk,
	bala.seshasayee@linux.intel.com, baolin.wang@linux.alibaba.com,
	chrisl@kernel.org, david@redhat.com, hch@infradead.org,
	kanchana.p.sridhar@intel.com, kasong@tencent.com,
	linux-mm@kvack.org, nphamcs@gmail.com, ryan.roberts@arm.com,
	senozhatsky@chromium.org, terrelln@fb.com,
	usamaarif642@gmail.com, v-songbaohua@oppo.com,
	wajdi.k.feghali@intel.com, willy@infradead.org,
	ying.huang@linux.alibaba.com, yosryahmed@google.com
Subject: Re: [PATCH RFC] mm: map zero-filled pages to zero_pfn while doing swap-in
Date: Fri, 13 Dec 2024 14:47:44 +1300	[thread overview]
Message-ID: <20241213014744.45296-1-21cnbao@gmail.com> (raw)
In-Reply-To: <20241212162508.GA4712@cmpxchg.org>

On Fri, Dec 13, 2024 at 5:25 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Dec 12, 2024 at 10:16:22PM +1300, Barry Song wrote:
> > On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 12.12.24 09:46, Barry Song wrote:
> > > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > >>
> > > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> > > >>> From: Barry Song <v-songbaohua@oppo.com>
> > > >>>
> > > >>> While developing the zeromap series, Usama observed that certain
> > > >>> workloads may contain over 10% zero-filled pages. This may present
> > > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn
> > > >>> in do_swap_page(). If a write occurs later, do_wp_page() can
> > > >>> allocate a new page using the Copy-on-Write mechanism.
> > > >>
> > > >> Shouldn't this be done during, or rather instead of swap out instead?
> > > >> Swapping all zero pages out just to optimize the in-memory
> > > >> representation on seems rather backwards.
> > > >
> > > > I’m having trouble understanding your point—it seems like you might
> > > > not have fully read the code. :-)
> > > >
> > > > The situation is as follows: for a zero-filled page, we are currently
> > > > allocating a new
> > > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> > > > save the memory used by this page.
> > > >
> > > > We don't need to allocate the memory until the page is written(which may never
> > > > happen).
> > >
> > > I think what Christoph means is that you would determine that at PTE
> > > unmap time, and directly place the zero page in there. So there would be
> > > no need to have the page fault at all.
> > >
> > > I suspect at PTE unmap time might be problematic, because we might still
> > > have other (i.e., GUP) references modifying that page, and we can only
> > > rely on the page content being stable after we flushed the TLB as well.
> > > (I recall some deferred flushing optimizations)
> >
> > Yes, we need to follow a strict sequence:
> >
> > 1. try_to_unmap - unmap PTEs in all processes;
> > 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown;
> > 3. pageout - zeromap will set 1 in bitmap if page is zero-filled
> >
> > At the moment of pageout(), we can be confident that the page is zero-filled.
> >
> > mapping to zeropage during unmap seems quite risky.
>
> You have to unmap and flush to stop modifications, but I think not in
> all processes before it's safe to decide. Shared anon pages have COW
> semantics; when you enter try_to_unmap() with a page and rmap gives
> you a pte, it's one of these:
>
>   a) never forked, no sibling ptes
>   b) cow broken into private copy, no sibling ptes
>   c) cow/WP; any writes to this or another pte will go to a new page.
>
> In cases a and b you need to unmap and flush the current pte, but then
> it's safe to check contents and set the zero pte right away, even
> before finishing the rmap walk.
>
> In case c, modifications to the page are impossible due to WP, so you
> don't even need to unmap and flush before checking the contents. The
> pte lock holds up COW breaking to a new page until you're done.
>
> It's definitely more complicated than the current implementation, but
> if it can be made to work, we could get rid of the bitmap.
>
> You might also reduce faults, but I'm a bit skeptical. Presumably
> zerofilled regions are mostly considered invalid by the application,
> not useful data, so a populating write that will cowbreak seems more
> likely to happen next than a faultless read from the zeropage.

Yes. That is right.

I created the following debug patch to count the proportional distribution
of zero_swpin reads, as well as the comparison between zero_swpin and
zero_swpout:

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..ed9d1a6cc565 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -136,6 +136,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		SWAP_RA_HIT,
 		SWPIN_ZERO,
 		SWPOUT_ZERO,
+		SWPIN_ZERO_READ,
 #ifdef CONFIG_KSM
 		KSM_SWPIN_COPY,
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index f3040c69f648..3aacfbe7bd77 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4400,6 +4400,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                                /* Count SWPIN_ZERO since page_io was skipped */
                                objcg = get_obj_cgroup_from_swap(entry);
                                count_vm_events(SWPIN_ZERO, 1);
+                               count_vm_events(SWPIN_ZERO_READ, 1);
                                if (objcg) {
                                        count_objcg_events(objcg, SWPIN_ZERO, 1);
                                        obj_cgroup_put(objcg);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4d016314a56c..9465fe9bda9e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1420,6 +1420,7 @@ const char * const vmstat_text[] = {
 	"swap_ra_hit",
 	"swpin_zero",
 	"swpout_zero",
+	"swpin_zero_read",
 #ifdef CONFIG_KSM
 	"ksm_swpin_copy",
 #endif


For a kernel-build workload in a single memcg with only 1GB of memory, use
the script below:

#!/bin/bash

echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled

vmstat_path="/proc/vmstat"
thp_base_path="/sys/kernel/mm/transparent_hugepage"

read_values() {
    pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}')
    pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}')
    pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}')
    pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}')
    swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}')
    swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}')
    swpin_zero_read=$(grep "swpin_zero_read" $vmstat_path | awk '{print $2}')
    
    echo "$pswpin $pswpout $pgpgin $pgpgout $swpout_zero $swpin_zero $swpin_zero_read"
}

for ((i=1; i<=5; i++))
do
  echo
  echo "*** Executing round $i ***"
  make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
  sync; echo 3 > /proc/sys/vm/drop_caches; sleep 1
  #kernel build
  initial_values=($(read_values))
  time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
        CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null
  final_values=($(read_values))

  echo "pswpin: $((final_values[0] - initial_values[0]))"
  echo "pswpout: $((final_values[1] - initial_values[1]))"
  echo "pgpgin: $((final_values[2] - initial_values[2]))"
  echo "pgpgout: $((final_values[3] - initial_values[3]))"
  echo "swpout_zero: $((final_values[4] - initial_values[4]))"
  echo "swpin_zero: $((final_values[5] - initial_values[5]))"
  echo "swpin_zero_read: $((final_values[6] - initial_values[6]))"
done


The results I am seeing are as follows:

real	6m43.998s
user	47m3.800s
sys	5m7.169s
pswpin: 342041
pswpout: 1470846
pgpgin: 11744932
pgpgout: 14466564
swpout_zero: 318030
swpin_zero: 93621
swpin_zero_read: 13118

The proportion of zero_swpout is quite large (> 10%): 318,030 vs. 1,470,846.
The percentage is 17.8% = 318,030 / (318,030 + 1,470,846).

About 29.4% (93,621 / 318,030) of these will be swapped in, and 14% of those
zero_swpin pages are read (13,118 / 93,621).

Therefore, a total of 17.8% * 29.4% * 14% = 0.73% of all swapped-out pages
will be re-mapped to zero_pfn, potentially saving up to 0.73% RSS in this
kernel-build workload. Thus, the total build time of my final results falls
within the testing jitter range, showing no noticeable difference while
the conceptual model code with lots of zero-filled pages and read swap-in
shows significant differences.

I'm not sure if we can identify another real workload with more read swpin
to observe noticeable improvements. Perhaps Usama has some?

Thanks
Barry

next prev parent reply	other threads:[~2024-12-13  1:48 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-12  7:37 Barry Song
2024-12-12  8:29 ` Christoph Hellwig
2024-12-12  8:46   ` Barry Song
2024-12-12  8:50     ` Christoph Hellwig
2024-12-12  8:54       ` Barry Song
2024-12-12  8:50     ` David Hildenbrand
2024-12-12  9:16       ` Barry Song
2024-12-12 16:25         ` Johannes Weiner
2024-12-13  1:47           ` Barry Song [this message]
2024-12-13  2:27             ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241213014744.45296-1-21cnbao@gmail.com \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=bala.seshasayee@linux.intel.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=senozhatsky@chromium.org \
    --cc=terrelln@fb.com \
    --cc=usamaarif642@gmail.com \
    --cc=v-songbaohua@oppo.com \
    --cc=wajdi.k.feghali@intel.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox