From: Byungchul Park <byungchul@sk.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
kernel_team@skhynix.com, akpm@linux-foundation.org,
ying.huang@intel.com, namit@vmware.com, xhao@linux.alibaba.com,
mgorman@techsingularity.net, hughd@google.com,
willy@infradead.org, peterz@infradead.org, luto@kernel.org,
tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com
Subject: Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration
Date: Mon, 30 Oct 2023 18:58:03 +0900 [thread overview]
Message-ID: <20231030095803.GA81877@system.software.com> (raw)
In-Reply-To: <a8337371-50ed-4618-b48e-78b96d18810f@redhat.com>
On Mon, Oct 30, 2023 at 09:00:56AM +0100, David Hildenbrand wrote:
> On 30.10.23 08:25, Byungchul Park wrote:
> > Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
> > We always face the migration overhead at either promotion or demotion,
> > while working with tiered memory e.g. CXL memory and found out TLB
> > shootdown is a quite big one that is needed to get rid of if possible.
> >
> > Fortunately, TLB flush can be defered or even skipped if both source and
> > destination of folios during migration are kept until all TLB flushes
> > required will have been done, of course, only if the target PTE entries
> > have read only permission, more precisely speaking, don't have write
> > permission. Otherwise, no doubt the folio might get messed up.
> >
> > To achieve that:
> >
> > 1. For the folios that map only to non-writable TLB entries, prevent
> > TLB flush at migration by keeping both source and destination
> > folios, which will be handled later at a better time.
> >
> > 2. When any non-writable TLB entry changes to writable e.g. through
> > fault handler, give up CONFIG_MIGRC mechanism so as to perform
> > TLB flush required right away.
> >
> > 3. Temporarily stop migrc from working when the system is in very
> > high memory pressure e.g. direct reclaim needed.
> >
> > The measurement result:
> >
> > Architecture - x86_64
> > QEMU - kvm enabled, host cpu
> > Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
> > Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled
> > Benchmark - XSBench -p 50000000 (-p option makes the runtime longer)
> >
> > run 'perf stat' using events:
> > 1) itlb.itlb_flush
> > 2) tlb_flush.dtlb_thread
> > 3) tlb_flush.stlb_any
> > 4) dTLB-load-misses
> > 5) dTLB-store-misses
> > 6) iTLB-load-misses
> >
> > run 'cat /proc/vmstat' and pick:
> > 1) numa_pages_migrated
> > 2) pgmigrate_success
> > 3) nr_tlb_remote_flush
> > 4) nr_tlb_remote_flush_received
> > 5) nr_tlb_local_flush_all
> > 6) nr_tlb_local_flush_one
> >
> > BEFORE - mainline v6.6-rc5
> > ------------------------------------------
> > $ perf stat -a \
> > -e itlb.itlb_flush \
> > -e tlb_flush.dtlb_thread \
> > -e tlb_flush.stlb_any \
> > -e dTLB-load-misses \
> > -e dTLB-store-misses \
> > -e iTLB-load-misses \
> > ./XSBench -p 50000000
> >
> > Performance counter stats for 'system wide':
> >
> > 20953405 itlb.itlb_flush
> > 114886593 tlb_flush.dtlb_thread
> > 88267015 tlb_flush.stlb_any
> > 115304095543 dTLB-load-misses
> > 163904743 dTLB-store-misses
> > 608486259 iTLB-load-misses
> >
> > 556.787113849 seconds time elapsed
> >
> > $ cat /proc/vmstat
> >
> > ...
> > numa_pages_migrated 3378748
> > pgmigrate_success 7720310
> > nr_tlb_remote_flush 751464
> > nr_tlb_remote_flush_received 10742115
> > nr_tlb_local_flush_all 21899
> > nr_tlb_local_flush_one 740157
> > ...
> >
> > AFTER - mainline v6.6-rc5 + CONFIG_MIGRC
> > ------------------------------------------
> > $ perf stat -a \
> > -e itlb.itlb_flush \
> > -e tlb_flush.dtlb_thread \
> > -e tlb_flush.stlb_any \
> > -e dTLB-load-misses \
> > -e dTLB-store-misses \
> > -e iTLB-load-misses \
> > ./XSBench -p 50000000
> >
> > Performance counter stats for 'system wide':
> >
> > 4353555 itlb.itlb_flush
> > 72482780 tlb_flush.dtlb_thread
> > 68226458 tlb_flush.stlb_any
> > 114331610808 dTLB-load-misses
> > 116084771 dTLB-store-misses
> > 377180518 iTLB-load-misses
> >
> > 552.667718220 seconds time elapsed
> >
> > $ cat /proc/vmstat
> >
>
> So, an improvement of 0.74% ? How stable are the results? Serious question:
I'm getting very stable result.
> worth the churn?
Yes, ultimately the time wise improvement should be observed. However,
I've been focusing on the numbers of TLB flushes and TLB misses because
better result in terms of total time will be followed depending on the
test condition. We can see the result if we test with a system that:
1. has more CPUs that would induce a crazy number of IPIs.
2. has slow memories that makes TLB miss overhead bigger.
3. runs workloads that is harmful at TLB miss and IPI storm.
4. runs workloads that causes heavier numa migrations.
5. runs workloads that has a lot of read only permission mappings.
6. and so on.
I will share the results once I manage to meet the conditions.
By the way, I should've added IPI reduction because it also has super
big delta :)
> Or did I get the numbers wrong?
>
> > #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 5c02720c53a5..1ca2ac91aa14 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -135,6 +135,9 @@ enum pageflags {
> > #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> > PG_arch_2,
> > PG_arch_3,
> > +#endif
> > +#ifdef CONFIG_MIGRC
> > + PG_migrc, /* Page has its copy under migrc's control */
> > #endif
> > __NR_PAGEFLAGS,
> > @@ -589,6 +592,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
> > PAGEFLAG(Idle, idle, PF_ANY)
> > #endif
> > +#ifdef CONFIG_MIGRC
> > +PAGEFLAG(Migrc, migrc, PF_ANY)
> > +#endif
>
> I assume you know this: new pageflags are frowned upon.
Sorry for that. I really didn't want to add a new headache.
Byungchul
next prev parent reply other threads:[~2023-10-30 9:58 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-30 7:25 [v3 0/3] Reduce TLB flushes under some specific conditions Byungchul Park
2023-10-30 7:25 ` [v3 1/3] mm/rmap: Recognize non-writable TLB entries during TLB batch flush Byungchul Park
2023-10-30 7:52 ` Nadav Amit
2023-10-30 10:26 ` Byungchul Park
2023-10-30 7:25 ` [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration Byungchul Park
2023-10-30 8:00 ` David Hildenbrand
2023-10-30 9:58 ` Byungchul Park [this message]
2023-11-01 3:06 ` Huang, Ying
2023-10-30 8:50 ` Nadav Amit
2023-10-30 12:51 ` Byungchul Park
2023-10-30 15:58 ` Nadav Amit
2023-10-30 22:40 ` Byungchul Park
2023-11-08 4:12 ` Byungchul Park
2023-11-09 10:16 ` Nadav Amit
2023-11-10 1:02 ` Byungchul Park
2023-11-10 3:13 ` Byungchul Park
2023-11-10 22:18 ` Nadav Amit
2023-11-15 5:48 ` Byungchul Park
2023-11-09 5:35 ` Byungchul Park
2023-10-30 7:25 ` [v3 3/3] mm, migrc: Add a sysctl knob to enable/disable MIGRC mechanism Byungchul Park
2023-10-30 8:51 ` Nadav Amit
2023-10-30 10:36 ` Byungchul Park
2023-10-30 17:55 ` [v3 0/3] Reduce TLB flushes under some specific conditions Dave Hansen
2023-10-30 18:32 ` Nadav Amit
2023-10-30 22:55 ` Byungchul Park
2023-10-31 8:46 ` David Hildenbrand
2023-10-31 2:37 ` Byungchul Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231030095803.GA81877@system.software.com \
--to=byungchul@sk.com \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=kernel_team@skhynix.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=namit@vmware.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=willy@infradead.org \
--cc=xhao@linux.alibaba.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox