Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Byungchul Park <byungchul@sk.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kernel_team@skhynix.com, akpm@linux-foundation.org,
	ying.huang@intel.com, namit@vmware.com, xhao@linux.alibaba.com,
	mgorman@techsingularity.net, hughd@google.com,
	willy@infradead.org, peterz@infradead.org, luto@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com
Subject: Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration
Date: Mon, 30 Oct 2023 18:58:03 +0900	[thread overview]
Message-ID: <20231030095803.GA81877@system.software.com> (raw)
In-Reply-To: <a8337371-50ed-4618-b48e-78b96d18810f@redhat.com>

On Mon, Oct 30, 2023 at 09:00:56AM +0100, David Hildenbrand wrote:
> On 30.10.23 08:25, Byungchul Park wrote:
> > Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
> > We always face the migration overhead at either promotion or demotion,
> > while working with tiered memory e.g. CXL memory and found out TLB
> > shootdown is a quite big one that is needed to get rid of if possible.
> > 
> > Fortunately, TLB flush can be defered or even skipped if both source and
> > destination of folios during migration are kept until all TLB flushes
> > required will have been done, of course, only if the target PTE entries
> > have read only permission, more precisely speaking, don't have write
> > permission. Otherwise, no doubt the folio might get messed up.
> > 
> > To achieve that:
> > 
> >     1. For the folios that map only to non-writable TLB entries, prevent
> >        TLB flush at migration by keeping both source and destination
> >        folios, which will be handled later at a better time.
> > 
> >     2. When any non-writable TLB entry changes to writable e.g. through
> >        fault handler, give up CONFIG_MIGRC mechanism so as to perform
> >        TLB flush required right away.
> > 
> >     3. Temporarily stop migrc from working when the system is in very
> >        high memory pressure e.g. direct reclaim needed.
> > 
> > The measurement result:
> > 
> >     Architecture - x86_64
> >     QEMU - kvm enabled, host cpu
> >     Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
> >     Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled
> >     Benchmark - XSBench -p 50000000 (-p option makes the runtime longer)
> > 
> >     run 'perf stat' using events:
> >        1) itlb.itlb_flush
> >        2) tlb_flush.dtlb_thread
> >        3) tlb_flush.stlb_any
> >        4) dTLB-load-misses
> >        5) dTLB-store-misses
> >        6) iTLB-load-misses
> > 
> >     run 'cat /proc/vmstat' and pick:
> >        1) numa_pages_migrated
> >        2) pgmigrate_success
> >        3) nr_tlb_remote_flush
> >        4) nr_tlb_remote_flush_received
> >        5) nr_tlb_local_flush_all
> >        6) nr_tlb_local_flush_one
> > 
> >     BEFORE - mainline v6.6-rc5
> >     ------------------------------------------
> >     $ perf stat -a \
> > 	   -e itlb.itlb_flush \
> > 	   -e tlb_flush.dtlb_thread \
> > 	   -e tlb_flush.stlb_any \
> > 	   -e dTLB-load-misses \
> > 	   -e dTLB-store-misses \
> > 	   -e iTLB-load-misses \
> > 	   ./XSBench -p 50000000
> > 
> >     Performance counter stats for 'system wide':
> > 
> >        20953405     itlb.itlb_flush
> >        114886593    tlb_flush.dtlb_thread
> >        88267015     tlb_flush.stlb_any
> >        115304095543 dTLB-load-misses
> >        163904743    dTLB-store-misses
> >        608486259	   iTLB-load-misses
> > 
> >     556.787113849 seconds time elapsed
> > 
> >     $ cat /proc/vmstat
> > 
> >     ...
> >     numa_pages_migrated 3378748
> >     pgmigrate_success 7720310
> >     nr_tlb_remote_flush 751464
> >     nr_tlb_remote_flush_received 10742115
> >     nr_tlb_local_flush_all 21899
> >     nr_tlb_local_flush_one 740157
> >     ...
> > 
> >     AFTER - mainline v6.6-rc5 + CONFIG_MIGRC
> >     ------------------------------------------
> >     $ perf stat -a \
> > 	   -e itlb.itlb_flush \
> > 	   -e tlb_flush.dtlb_thread \
> > 	   -e tlb_flush.stlb_any \
> > 	   -e dTLB-load-misses \
> > 	   -e dTLB-store-misses \
> > 	   -e iTLB-load-misses \
> > 	   ./XSBench -p 50000000
> > 
> >     Performance counter stats for 'system wide':
> > 
> >        4353555      itlb.itlb_flush
> >        72482780     tlb_flush.dtlb_thread
> >        68226458     tlb_flush.stlb_any
> >        114331610808 dTLB-load-misses
> >        116084771    dTLB-store-misses
> >        377180518    iTLB-load-misses
> > 
> >     552.667718220 seconds time elapsed
> > 
> >     $ cat /proc/vmstat
> > 
> 
> So, an improvement of 0.74% ? How stable are the results? Serious question:

I'm getting very stable result.

> worth the churn?

Yes, ultimately the time wise improvement should be observed. However,
I've been focusing on the numbers of TLB flushes and TLB misses because
better result in terms of total time will be followed depending on the
test condition. We can see the result if we test with a system that:

   1. has more CPUs that would induce a crazy number of IPIs.
   2. has slow memories that makes TLB miss overhead bigger.
   3. runs workloads that is harmful at TLB miss and IPI storm.
   4. runs workloads that causes heavier numa migrations.
   5. runs workloads that has a lot of read only permission mappings.
   6. and so on.

I will share the results once I manage to meet the conditions.

By the way, I should've added IPI reduction because it also has super
big delta :)

> Or did I get the numbers wrong?
> 
> >   #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 5c02720c53a5..1ca2ac91aa14 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -135,6 +135,9 @@ enum pageflags {
> >   #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> >   	PG_arch_2,
> >   	PG_arch_3,
> > +#endif
> > +#ifdef CONFIG_MIGRC
> > +	PG_migrc,		/* Page has its copy under migrc's control */
> >   #endif
> >   	__NR_PAGEFLAGS,
> > @@ -589,6 +592,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
> >   PAGEFLAG(Idle, idle, PF_ANY)
> >   #endif
> > +#ifdef CONFIG_MIGRC
> > +PAGEFLAG(Migrc, migrc, PF_ANY)
> > +#endif
> 
> I assume you know this: new pageflags are frowned upon.

Sorry for that. I really didn't want to add a new headache.

	Byungchul

next prev parent reply	other threads:[~2023-10-30  9:58 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-30  7:25 [v3 0/3] Reduce TLB flushes under some specific conditions Byungchul Park
2023-10-30  7:25 ` [v3 1/3] mm/rmap: Recognize non-writable TLB entries during TLB batch flush Byungchul Park
2023-10-30  7:52   ` Nadav Amit
2023-10-30 10:26     ` Byungchul Park
2023-10-30  7:25 ` [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration Byungchul Park
2023-10-30  8:00   ` David Hildenbrand
2023-10-30  9:58     ` Byungchul Park [this message]
2023-11-01  3:06       ` Huang, Ying
2023-10-30  8:50   ` Nadav Amit
2023-10-30 12:51     ` Byungchul Park
2023-10-30 15:58       ` Nadav Amit
2023-10-30 22:40         ` Byungchul Park
2023-11-08  4:12       ` Byungchul Park
2023-11-09 10:16         ` Nadav Amit
2023-11-10  1:02           ` Byungchul Park
2023-11-10  3:13             ` Byungchul Park
2023-11-10 22:18               ` Nadav Amit
2023-11-15  5:48                 ` Byungchul Park
2023-11-09  5:35       ` Byungchul Park
2023-10-30  7:25 ` [v3 3/3] mm, migrc: Add a sysctl knob to enable/disable MIGRC mechanism Byungchul Park
2023-10-30  8:51   ` Nadav Amit
2023-10-30 10:36     ` Byungchul Park
2023-10-30 17:55 ` [v3 0/3] Reduce TLB flushes under some specific conditions Dave Hansen
2023-10-30 18:32   ` Nadav Amit
2023-10-30 22:55   ` Byungchul Park
2023-10-31  8:46     ` David Hildenbrand
2023-10-31  2:37   ` Byungchul Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231030095803.GA81877@system.software.com \
    --to=byungchul@sk.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=kernel_team@skhynix.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=namit@vmware.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    --cc=xhao@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox