From: Joe Damato <jdamato@fastly.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: x86@kernel.org, linux-mm@kvack.org,
Dave Hansen <dave.hansen@linux.intel.com>,
Andy Lutomirski <luto@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
"H. Peter Anvin" <hpa@zytor.com>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Valentin Schneider <vschneid@redhat.com>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC 1/1] mm: Add per-task struct tlb counters
Date: Wed, 14 Sep 2022 07:15:08 -0700 [thread overview]
Message-ID: <20220914141507.GA4422@fastly.com> (raw)
In-Reply-To: <e0067441-19e2-2ae6-df47-2018672426be@intel.com>
On Wed, Sep 14, 2022 at 12:40:55AM -0700, Dave Hansen wrote:
> On 9/13/22 18:51, Joe Damato wrote:
> > TLB shootdowns are tracked globally, but on a busy system it can be
> > difficult to disambiguate the source of TLB shootdowns.
> >
> > Add two counter fields:
> > - nrtlbflush: number of tlb flush events received
> > - ngtlbflush: number of tlb flush events generated
> >
> > Expose those fields in /proc/[pid]/stat so that they can be analyzed
> > alongside similar metrics (e.g. min_flt and maj_flt).
>
> On x86 at least, we already have two other ways to count flushes. You
> even quoted them with your patch:
>
> > count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
> > + current->ngtlbflush++;
> > if (info->end == TLB_FLUSH_ALL)
> > trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
>
> Granted, the count_vm_tlb...() one is debugging only. But, did you try
> to use those other mechanisms? For instance, could you patch
> count_vm_tlb_event()?
I tried to address this in my cover letter[1], but the count_vm_tlb_event
are system-wide, AFAICT. This is useful, certainly, but it's difficult to
know how many TLB shootdowns are being generated by which tasks without
finer granularity. The goal was to try to account these events on a
per-task basis.
I could patch count_vm_tlb... to account on a per-task basis. That seems
reasonable to me... assuming you and others are convinced that it's a
better approach than tracepoints ;)
> Why didn't the tracepoints work for you?
Tracepoints do work; but IMHO the trouble with tracepoints in this case is:
- You need to actually be running perf to gather the data at the right
time; if you stop running perf too soon, or if the TLB shootdown storm is
caused by some anomalous event when you weren't running perf... you are
out of luck.
- On heavily loaded systems with O(10,000) or O(100,000) tasks, perf
tracepoint data is hard to analyze, events can be dropped, and
significant resources can be consumed.
In addition to this, there is existing tooling on Linux for scraping
/proc/[pid]/stat for graphing/analysis/etc.
IMO, possibly an easier way to debug large TLB shootdowns on a system might
be (using a form of this patch):
1. Examine /proc/[pid]/stat to see which process or processes are
responsible for the majority of the shootdowns. Perhaps you have a script
scraping this data at various intervals and recording deltas.
2. Now that you know the timeline of the events, which processes are
responsible, and the magnitude of the deltas... perf tracepoints can help
you determine when and where exactly they occur.
What do you think?
> Can this be done in a more arch-generic way? It's a shame to
> unconditionally add counters to the task struct and only use them on
> x86. If someone wanted to generalize the x86 tracepoints, or make them
> available to other architectures, I think that would be fine even if
> they have to change a bit (queue the inevitable argument about
> tracepoint ABI).
I'm not sure; maybe if I tweaked count_vm_tlb then I suppose if archs
other than x86 support count_vm_tlb in the future, they would
automatically get support for this.
> P.S. I'm not a fan of the structure member naming.
Fair enough; I was inspired by nvcsw and nivcsw :) but if you think that
this worth pursuing, I'll use more clear names in the future.
Thanks for taking a look!
next prev parent reply other threads:[~2022-09-14 14:15 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-14 1:51 [RFC 0/1] mm: Track per-task tlb events Joe Damato
2022-09-14 1:51 ` [RFC 1/1] mm: Add per-task struct tlb counters Joe Damato
2022-09-14 7:40 ` Dave Hansen
2022-09-14 11:58 ` Peter Zijlstra
2022-09-14 14:23 ` Joe Damato
2022-09-14 14:15 ` Joe Damato [this message]
2022-09-14 14:25 ` Joe Damato
2022-09-15 8:50 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220914141507.GA4422@fastly.com \
--to=jdamato@fastly.com \
--cc=bp@alien8.de \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dave.hansen@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=dietmar.eggemann@arm.com \
--cc=hpa@zytor.com \
--cc=juri.lelli@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox