* [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
2009-08-08 5:47 ` KOSAKI Motohiro
2009-08-07 17:40 ` [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
` (4 subsequent siblings)
5 siblings, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
To: Larry Woodman, Ingo Molnar, Andrew Morton
Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman
This patch adds trace events for the allocation and freeing of pages,
including the freeing of pagevecs. Using the events, it will be known what
struct page and pfns are being allocated and freed and what the call site
was in many cases.
The page alloc tracepoints be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.
The pagevec free tracepoint has a different usecase. It can be used to get
a idea of how many pages are being dumped off the LRU and whether it is
kswapd doing the work or a process doing direct reclaim.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
include/trace/events/kmem.h | 74 +++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 7 +++-
2 files changed, 80 insertions(+), 1 deletions(-)
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 1493c54..0d358a0 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -225,6 +225,80 @@ TRACE_EVENT(kmem_cache_free,
TP_printk("call_site=%lx ptr=%p", __entry->call_site, __entry->ptr)
);
+
+TRACE_EVENT(mm_page_free_direct,
+
+ TP_PROTO(struct page *page, unsigned int order),
+
+ TP_ARGS(page, order),
+
+ TP_STRUCT__entry(
+ __field( struct page *, page )
+ __field( unsigned int, order )
+ ),
+
+ TP_fast_assign(
+ __entry->page = page;
+ __entry->order = order;
+ ),
+
+ TP_printk("page=%p pfn=%lu order=%d",
+ __entry->page,
+ page_to_pfn(__entry->page),
+ __entry->order)
+);
+
+TRACE_EVENT(mm_pagevec_free,
+
+ TP_PROTO(struct page *page, int cold),
+
+ TP_ARGS(page, cold),
+
+ TP_STRUCT__entry(
+ __field( struct page *, page )
+ __field( int, cold )
+ ),
+
+ TP_fast_assign(
+ __entry->page = page;
+ __entry->cold = cold;
+ ),
+
+ TP_printk("page=%p pfn=%lu order=0 cold=%d",
+ __entry->page,
+ page_to_pfn(__entry->page),
+ __entry->cold)
+);
+
+TRACE_EVENT(mm_page_alloc,
+
+ TP_PROTO(struct page *page, unsigned int order,
+ gfp_t gfp_flags, int migratetype),
+
+ TP_ARGS(page, order, gfp_flags, migratetype),
+
+ TP_STRUCT__entry(
+ __field( struct page *, page )
+ __field( unsigned int, order )
+ __field( gfp_t, gfp_flags )
+ __field( int, migratetype )
+ ),
+
+ TP_fast_assign(
+ __entry->page = page;
+ __entry->order = order;
+ __entry->gfp_flags = gfp_flags;
+ __entry->migratetype = migratetype;
+ ),
+
+ TP_printk("page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
+ __entry->page,
+ page_to_pfn(__entry->page),
+ __entry->order,
+ __entry->migratetype,
+ show_gfp_flags(__entry->gfp_flags))
+);
+
#endif /* _TRACE_KMEM_H */
/* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d052abb..440c439 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1062,6 +1062,7 @@ static void free_hot_cold_page(struct page *page, int cold)
void free_hot_page(struct page *page)
{
+ trace_mm_page_free_direct(page, 0);
free_hot_cold_page(page, 0);
}
@@ -1905,6 +1906,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
+ trace_mm_page_alloc(page, order, gfp_mask, migratetype);
return page;
}
EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -1945,13 +1947,16 @@ void __pagevec_free(struct pagevec *pvec)
{
int i = pagevec_count(pvec);
- while (--i >= 0)
+ while (--i >= 0) {
+ trace_mm_pagevec_free(pvec->pages[i], pvec->cold);
free_hot_cold_page(pvec->pages[i], pvec->cold);
+ }
}
void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page)) {
+ trace_mm_page_free_direct(page, order);
if (order == 0)
free_hot_page(page);
else
--
1.6.3.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
2009-08-07 17:40 ` [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
@ 2009-08-08 5:47 ` KOSAKI Motohiro
0 siblings, 0 replies; 11+ messages in thread
From: KOSAKI Motohiro @ 2009-08-08 5:47 UTC (permalink / raw)
To: Mel Gorman
Cc: Larry Woodman, Ingo Molnar, Andrew Morton, riel, Peter Zijlstra,
LKML, linux-mm
2009/8/8 Mel Gorman <mel@csn.ul.ie>:
> This patch adds trace events for the allocation and freeing of pages,
> including the freeing of pagevecs. Using the events, it will be known what
> struct page and pfns are being allocated and freed and what the call site
> was in many cases.
>
> The page alloc tracepoints be used as an indicator as to whether the workload
> was heavily dependant on the page allocator or not. You can make a guess based
> on vmstat but you can't get a per-process breakdown. Depending on the call
> path, the call_site for page allocation may be __get_free_pages() instead
> of a useful callsite. Instead of passing down a return address similar to
> slab debugging, the user should enable the stacktrace and seg-addr options
> to get a proper stack trace.
>
> The pagevec free tracepoint has a different usecase. It can be used to get
> a idea of how many pages are being dumped off the LRU and whether it is
> kswapd doing the work or a process doing direct reclaim.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
Looks good to me.
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
2009-08-07 17:40 ` [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
2009-08-07 17:40 ` [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
` (3 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
To: Larry Woodman, Ingo Molnar, Andrew Morton
Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman
Fragmentation avoidance depends on being able to use free pages from
lists of the appropriate migrate type. In the event this is not
possible, __rmqueue_fallback() selects a different list and in some
circumstances change the migratetype of the pageblock. Simplistically,
the more times this event occurs, the more likely that fragmentation
will be a problem later for hugepage allocation at least but there are
other considerations such as the order of page being split to satisfy
the allocation.
This patch adds a trace event for __rmqueue_fallback() that reports what
page is being used for the fallback, the orders of relevant pages, the
desired migratetype and the migratetype of the lists being used, whether
the pageblock changed type and whether this event is important with
respect to fragmentation avoidance or not. This information can be used
to help analyse fragmentation avoidance and help decide whether
min_free_kbytes should be increased or not.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
include/trace/events/kmem.h | 38 ++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 4 ++++
2 files changed, 42 insertions(+), 0 deletions(-)
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 0d358a0..aae16ee 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -299,6 +299,44 @@ TRACE_EVENT(mm_page_alloc,
show_gfp_flags(__entry->gfp_flags))
);
+TRACE_EVENT(mm_page_alloc_extfrag,
+
+ TP_PROTO(struct page *page,
+ int alloc_order, int fallback_order,
+ int alloc_migratetype, int fallback_migratetype),
+
+ TP_ARGS(page,
+ alloc_order, fallback_order,
+ alloc_migratetype, fallback_migratetype),
+
+ TP_STRUCT__entry(
+ __field( struct page *, page )
+ __field( int, alloc_order )
+ __field( int, fallback_order )
+ __field( int, alloc_migratetype )
+ __field( int, fallback_migratetype )
+ ),
+
+ TP_fast_assign(
+ __entry->page = page;
+ __entry->alloc_order = alloc_order;
+ __entry->fallback_order = fallback_order;
+ __entry->alloc_migratetype = alloc_migratetype;
+ __entry->fallback_migratetype = fallback_migratetype;
+ ),
+
+ TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
+ __entry->page,
+ page_to_pfn(__entry->page),
+ __entry->alloc_order,
+ __entry->fallback_order,
+ pageblock_order,
+ __entry->alloc_migratetype,
+ __entry->fallback_migratetype,
+ __entry->fallback_order < pageblock_order,
+ __entry->alloc_migratetype == __entry->fallback_migratetype)
+);
+
#endif /* _TRACE_KMEM_H */
/* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 440c439..4c20bfa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -839,6 +839,10 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
start_migratetype);
expand(zone, page, order, current_order, area, migratetype);
+
+ trace_mm_page_alloc_extfrag(page, order, current_order,
+ start_migratetype, migratetype);
+
return page;
}
}
--
1.6.3.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
2009-08-07 17:40 ` [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
2009-08-07 17:40 ` [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
2009-08-07 17:40 ` [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
` (2 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
To: Larry Woodman, Ingo Molnar, Andrew Morton
Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman
The page allocation trace event reports that a page was successfully allocated
but it does not specify where it came from. When analysing performance,
it can be important to distinguish between pages coming from the per-cpu
allocator and pages coming from the buddy lists as the latter requires the
zone lock to the taken and more data structures to be examined.
This patch adds a trace event for __rmqueue reporting when a page is being
allocated from the buddy lists. It distinguishes between being called
to refill the per-cpu lists or whether it is a high-order allocation.
Similarly, this patch adds an event to catch when the PCP lists are being
drained a little and pages are going back to the buddy lists.
This is trickier to draw conclusions from but high activity on those
events could explain why there were a large number of cache misses on a
page-allocator-intensive workload. The coalescing and splitting of buddies
involves a lot of writing of page metadata and cache line bounces not to
mention the acquisition of an interrupt-safe lock necessary to enter this
path.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
include/trace/events/kmem.h | 51 +++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 2 +
2 files changed, 53 insertions(+), 0 deletions(-)
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index aae16ee..eaf46bd 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -299,6 +299,57 @@ TRACE_EVENT(mm_page_alloc,
show_gfp_flags(__entry->gfp_flags))
);
+TRACE_EVENT(mm_page_alloc_zone_locked,
+
+ TP_PROTO(struct page *page, unsigned int order, int migratetype),
+
+ TP_ARGS(page, order, migratetype),
+
+ TP_STRUCT__entry(
+ __field( struct page *, page )
+ __field( unsigned int, order )
+ __field( int, migratetype )
+ ),
+
+ TP_fast_assign(
+ __entry->page = page;
+ __entry->order = order;
+ __entry->migratetype = migratetype;
+ ),
+
+ TP_printk("page=%p pfn=%lu order=%u migratetype=%d percpu_refill=%d",
+ __entry->page,
+ page_to_pfn(__entry->page),
+ __entry->order,
+ __entry->migratetype,
+ __entry->order == 0)
+);
+
+TRACE_EVENT(mm_page_pcpu_drain,
+
+ TP_PROTO(struct page *page, int order, int migratetype),
+
+ TP_ARGS(page, order, migratetype),
+
+ TP_STRUCT__entry(
+ __field( struct page *, page )
+ __field( int, order )
+ __field( int, migratetype )
+ ),
+
+ TP_fast_assign(
+ __entry->page = page;
+ __entry->order = order;
+ __entry->migratetype = migratetype;
+ ),
+
+ TP_printk("page=%p pfn=%lu order=%d migratetype=%d",
+ __entry->page,
+ page_to_pfn(__entry->page),
+ __entry->order,
+ __entry->migratetype)
+);
+
TRACE_EVENT(mm_page_alloc_extfrag,
TP_PROTO(struct page *page,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4c20bfa..bbd7de8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -535,6 +535,7 @@ static void free_pages_bulk(struct zone *zone, int count,
page = list_entry(list->prev, struct page, lru);
/* have to delete it as __free_one_page list manipulates */
list_del(&page->lru);
+ trace_mm_page_pcpu_drain(page, order, page_private(page));
__free_one_page(page, zone, order, page_private(page));
}
spin_unlock(&zone->lock);
@@ -876,6 +877,7 @@ retry_reserve:
}
}
+ trace_mm_page_alloc_zone_locked(page, order, migratetype);
return page;
}
--
1.6.3.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
` (2 preceding siblings ...)
2009-08-07 17:40 ` [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
2009-08-07 19:10 ` Li, Ming Chun
2009-08-07 17:40 ` [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints Mel Gorman
2009-08-07 17:40 ` [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints Mel Gorman
5 siblings, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
To: Larry Woodman, Ingo Molnar, Andrew Morton
Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman
This patch adds a simple post-processing script for the page-allocator-related
trace events. It can be used to give an indication of who the most
allocator-intensive processes are and how often the zone lock was taken
during the tracing period. Example output looks like
Process Pages Pages Pages Pages PCPU PCPU PCPU Fragment Fragment MigType Fragment Fragment Unknown
details allocd allocd freed freed pages drains refills Fallback Causing Changed Severe Moderate
under lock direct pagevec drain
swapper-0 0 0 2 0 0 0 0 0 0 0 0 0 0
Xorg-3770 10603 5952 3685 6978 5996 194 192 0 0 0 0 0 0
modprobe-21397 51 0 0 86 31 1 0 0 0 0 0 0 0
xchat-5370 228 93 0 0 0 0 3 0 0 0 0 0 0
awesome-4317 32 32 0 0 0 0 32 0 0 0 0 0 0
thinkfan-3863 2 0 1 1 0 0 0 0 0 0 0 0 0
hald-addon-stor-3935 2 0 0 0 0 0 0 0 0 0 0 0 0
akregator-4506 1 1 0 0 0 0 1 0 0 0 0 0 0
xmms-14888 0 0 1 0 0 0 0 0 0 0 0 0 0
khelper-12 1 0 0 0 0 0 0 0 0 0 0 0 0
Optionally, the output can include information on the parent or aggregate
based on process name instead of aggregating based on each pid. Example output
including parent information and stripped out the PID looks something like;
Process Pages Pages Pages Pages PCPU PCPU PCPU Fragment Fragment MigType Fragment Fragment Unknown
details allocd allocd freed freed pages drains refills Fallback Causing Changed Severe Moderate
under lock direct pagevec drain
gdm-3756 :: Xorg-3770 3796 2976 99 3813 3224 104 98 0 0 0 0 0 0
init-1 :: hald-3892 1 0 0 0 0 0 0 0 0 0 0 0 0
git-21447 :: editor-21448 4 0 4 0 0 0 0 0 0 0 0 0 0
This says that Xorg allocated 3796 pages and it's parent process is gdm
with a PID of 3756;
The postprocessor parses the text output of tracing. While there is a binary
format, the expectation is that the binary output can be readily translated
into text and post-processed offline. Obviously if the text format changes,
the parser will break but the regular expression parser is fairly rudimentary
so should be readily adjustable.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
.../postprocess/trace-pagealloc-postprocess.pl | 418 ++++++++++++++++++++
1 files changed, 418 insertions(+), 0 deletions(-)
create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100755
index 0000000..1a8a408
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,418 @@
+#!/usr/bin/perl
+# This is a POC (proof of concept or piece of crap, take your pick) for reading the
+# text representation of trace output related to page allocation. It makes an attempt
+# to extract some high-level information on what is going on. The accuracy of the parser
+# may vary considerably
+#
+# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+# --prepend-parent Report on the parent proc and PID
+# --read-procstat If the trace lacks process info, get it from /proc
+# --ignore-pid Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_PAGE_ALLOC => 1;
+use constant MM_PAGE_FREE_DIRECT => 2;
+use constant MM_PAGEVEC_FREE => 3;
+use constant MM_PAGE_PCPU_DRAIN => 4;
+use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5;
+use constant MM_PAGE_ALLOC_EXTFRAG => 6;
+use constant EVENT_UNKNOWN => 7;
+
+# Constants used to track state
+use constant STATE_PCPU_PAGES_DRAINED => 8;
+use constant STATE_PCPU_PAGES_REFILLED => 9;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_PCPU_DRAINS => 10;
+use constant HIGH_PCPU_REFILLS => 11;
+use constant HIGH_EXT_FRAGMENT => 12;
+use constant HIGH_EXT_FRAGMENT_SEVERE => 13;
+use constant HIGH_EXT_FRAGMENT_MODERATE => 14;
+use constant HIGH_EXT_FRAGMENT_CHANGED => 15;
+
+my %perprocesspid;
+my %perprocess;
+my $opt_ignorepid;
+my $opt_read_procstat;
+my $opt_prepend_parent;
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+ my $current_time = time;
+ if ($current_time - 2 > $sigint_received) {
+ print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+ $sigint_report = 1;
+ } else {
+ if (!$sigint_exit) {
+ print "Second SIGINT received quickly, exiting\n";
+ }
+ $sigint_exit++;
+ }
+
+ if ($sigint_exit > 3) {
+ print "Many SIGINTs received, exiting now without report\n";
+ exit;
+ }
+
+ $sigint_received = $current_time;
+ $sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+ 'ignore-pid' => \$opt_ignorepid,
+ 'read-procstat' => \$opt_read_procstat,
+ 'prepend-parent' => \$opt_prepend_parent,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';
+
+# Dyanically discovered regex
+my $regex_fragdetails;
+
+# Static regex used. Specified like this for readability and for use with /o
+# (process_pid) (cpus ) ( time ) (tpoint ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+ my $event = shift;
+ my $default = shift;
+ my @fields = @_;
+ my $regex;
+
+ # Read the event format or use the default
+ if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+ $regex = $default;
+ } else {
+ my $line;
+ while (!eof(FORMAT)) {
+ $line = <FORMAT>;
+ if ($line =~ /^print fmt:\s"(.*)",.*/) {
+ $regex = $1;
+ $regex =~ s/%p/\([0-9a-f]*\)/g;
+ $regex =~ s/%d/\([-0-9]*\)/g;
+ $regex =~ s/%lu/\([0-9]*\)/g;
+ }
+ }
+ }
+
+ # Verify fields are in the right order
+ my $tuple;
+ foreach $tuple (split /\s/, $regex) {
+ my ($key, $value) = split(/=/, $tuple);
+ my $expected = shift;
+ if ($key ne $expected) {
+ print("WARNING: Format not as expected '$key' != '$expected'");
+ $regex =~ s/$key=\((.*)\)/$key=$1/;
+ }
+ }
+ if (defined $_) {
+ die("Fewer fields than expected in format");
+ }
+
+ return $regex;
+}
+$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
+ $regex_fragdetails_default,
+ "page", "pfn",
+ "alloc_order", "fallback_order", "pageblock_order",
+ "alloc_migratetype", "fallback_migratetype",
+ "fragmenting", "change_ownership");
+
+sub read_statline($) {
+ my $pid = $_[0];
+ my $statline;
+
+ if (open(STAT, "/proc/$pid/stat")) {
+ $statline = <STAT>;
+ close(STAT);
+ }
+
+ if ($statline eq '') {
+ $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+ }
+
+ return $statline;
+}
+
+sub guess_process_pid($$) {
+ my $pid = $_[0];
+ my $statline = $_[1];
+
+ if ($pid == 0) {
+ return "swapper-0";
+ }
+
+ if ($statline !~ /$regex_statname/o) {
+ die("Failed to math stat line for process name :: $statline");
+ }
+ return "$1-$pid";
+}
+
+sub parent_info($$) {
+ my $pid = $_[0];
+ my $statline = $_[1];
+ my $ppid;
+
+ if ($pid == 0) {
+ return "NOPARENT-0";
+ }
+
+ if ($statline !~ /$regex_statppid/o) {
+ die("Failed to match stat line process ppid:: $statline");
+ }
+
+ # Read the ppid stat line
+ $ppid = $1;
+ return guess_process_pid($ppid, read_statline($ppid));
+}
+
+sub process_events {
+ my $traceevent;
+ my $process_pid;
+ my $cpus;
+ my $timestamp;
+ my $tracepoint;
+ my $details;
+ my $statline;
+
+ # Read each line of the event log
+EVENT_PROCESS:
+ while ($traceevent = <STDIN>) {
+ if ($traceevent =~ /$regex_traceevent/o) {
+ $process_pid = $1;
+ $tracepoint = $4;
+
+ if ($opt_read_procstat || $opt_prepend_parent) {
+ $process_pid =~ /(.*)-([0-9]*)$/;
+ my $process = $1;
+ my $pid = $2;
+
+ $statline = read_statline($pid);
+
+ if ($opt_read_procstat && $process eq '') {
+ $process_pid = guess_process_pid($pid, $statline);
+ }
+
+ if ($opt_prepend_parent) {
+ $process_pid = parent_info($pid, $statline) . " :: $process_pid";
+ }
+ }
+
+ # Unnecessary in this script. Uncomment if required
+ # $cpus = $2;
+ # $timestamp = $3;
+ } else {
+ next;
+ }
+
+ # Perl Switch() sucks majorly
+ if ($tracepoint eq "mm_page_alloc") {
+ $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
+ } elsif ($tracepoint eq "mm_page_free_direct") {
+ $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
+ } elsif ($tracepoint eq "mm_pagevec_free") {
+ $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
+ } elsif ($tracepoint eq "mm_page_pcpu_drain") {
+ $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
+ $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
+ } elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
+ $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
+ $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
+ } elsif ($tracepoint eq "mm_page_alloc_extfrag") {
+
+ # Extract the details of the event now
+ $details = $5;
+
+ my ($page, $pfn);
+ my ($alloc_order, $fallback_order, $pageblock_order);
+ my ($alloc_migratetype, $fallback_migratetype);
+ my ($fragmenting, $change_ownership);
+
+ if ($details !~ /$regex_fragdetails/o) {
+ print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n";
+ next;
+ }
+
+ $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
+ $page = $1;
+ $pfn = $2;
+ $alloc_order = $3;
+ $fallback_order = $4;
+ $pageblock_order = $5;
+ $alloc_migratetype = $6;
+ $fallback_migratetype = $7;
+ $fragmenting = $8;
+ $change_ownership = $9;
+
+ if ($fragmenting) {
+ $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
+ if ($fallback_order <= 3) {
+ $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
+ } else {
+ $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
+ }
+ }
+ if ($change_ownership) {
+ $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
+ }
+ } else {
+ $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+ }
+
+ # Catch a full pcpu drain event
+ if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
+ $tracepoint ne "mm_page_pcpu_drain") {
+
+ $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
+ $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+ }
+
+ # Catch a full pcpu refill event
+ if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
+ $tracepoint ne "mm_page_alloc_zone_locked") {
+ $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
+ $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+ }
+
+ if ($sigint_pending) {
+ last EVENT_PROCESS;
+ }
+ }
+}
+
+sub dump_stats {
+ my $hashref = shift;
+ my %stats = %$hashref;
+
+ # Dump per-process stats
+ my $process_pid;
+ my $max_strlen = 0;
+
+ # Get the maximum process name
+ foreach $process_pid (keys %perprocesspid) {
+ my $len = length($process_pid);
+ if ($len > $max_strlen) {
+ $max_strlen = $len;
+ }
+ }
+ $max_strlen += 2;
+
+ printf("\n");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
+ "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
+ "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", "");
+
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
+ "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
+
+ foreach $process_pid (keys %stats) {
+ # Dump final aggregates
+ if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
+ $stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
+ $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+ }
+ if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
+ $stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
+ $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+ }
+
+ printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n",
+ $process_pid,
+ $stats{$process_pid}->{MM_PAGE_ALLOC},
+ $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
+ $stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
+ $stats{$process_pid}->{MM_PAGEVEC_FREE},
+ $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
+ $stats{$process_pid}->{HIGH_PCPU_DRAINS},
+ $stats{$process_pid}->{HIGH_PCPU_REFILLS},
+ $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
+ $stats{$process_pid}->{HIGH_EXT_FRAG},
+ $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
+ $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
+ $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
+ $stats{$process_pid}->{EVENT_UNKNOWN});
+ }
+}
+
+sub aggregate_perprocesspid() {
+ my $process_pid;
+ my $process;
+ undef %perprocess;
+
+ foreach $process_pid (keys %perprocesspid) {
+ $process = $process_pid;
+ $process =~ s/-([0-9])*$//;
+ if ($process eq '') {
+ $process = "NO_PROCESS_NAME";
+ }
+
+ $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
+ $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
+ $perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
+ $perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
+ $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
+ $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
+ $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
+ $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
+ $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
+ $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
+ $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
+ $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
+ $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
+ }
+}
+
+sub report() {
+ if (!$opt_ignorepid) {
+ dump_stats(\%perprocesspid);
+ } else {
+ aggregate_perprocesspid();
+ dump_stats(\%perprocess);
+ }
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+ my $sigint_processed;
+ do {
+ $sigint_processed = 0;
+ process_events();
+
+ # Handle pending signals if any
+ if ($sigint_pending) {
+ my $current_time = time;
+
+ if ($sigint_exit) {
+ print "Received exit signal\n";
+ $sigint_pending = 0;
+ }
+ if ($sigint_report) {
+ if ($current_time >= $sigint_received + 2) {
+ report();
+ $sigint_report = 0;
+ $sigint_pending = 0;
+ $sigint_processed = 1;
+ }
+ }
+ }
+ } while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
--
1.6.3.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
2009-08-07 17:40 ` [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
@ 2009-08-07 19:10 ` Li, Ming Chun
2009-08-07 19:32 ` Li, Ming Chun
0 siblings, 1 reply; 11+ messages in thread
From: Li, Ming Chun @ 2009-08-07 19:10 UTC (permalink / raw)
To: Mel Gorman; +Cc: LKML, linux-mm
On Fri, 7 Aug 2009, Mel Gorman wrote:
> +sub generate_traceevent_regex {
> + my $event = shift;
> + my $default = shift;
> + my @fields = @_;
> + my $regex;
You are using shift to retrieve parameters below, @fields is not used
anywhere.
> +
> + # Read the event format or use the default
> + if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
> + $regex = $default;
> + } else {
> + my $line;
> + while (!eof(FORMAT)) {
> + $line = <FORMAT>;
> + if ($line =~ /^print fmt:\s"(.*)",.*/) {
> + $regex = $1;
> + $regex =~ s/%p/\([0-9a-f]*\)/g;
> + $regex =~ s/%d/\([-0-9]*\)/g;
> + $regex =~ s/%lu/\([0-9]*\)/g;
> + }
> + }
> + }
> +
> + # Verify fields are in the right order
> + my $tuple;
> + foreach $tuple (split /\s/, $regex) {
> + my ($key, $value) = split(/=/, $tuple);
> + my $expected = shift;
> + if ($key ne $expected) {
> + print("WARNING: Format not as expected '$key' != '$expected'");
> + $regex =~ s/$key=\((.*)\)/$key=$1/;
> + }
> + }
> + if (defined $_) {
> + die("Fewer fields than expected in format");
> + }
> +
How about:
if (defined shift) {
die("Fewer fields than expected in format");
}
?
I don't know, just ask if it is clear.
> + return $regex;
> +}
Vincent Li
Biomedical Research Center
University of British Columbia
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
2009-08-07 19:10 ` Li, Ming Chun
@ 2009-08-07 19:32 ` Li, Ming Chun
2009-08-10 7:41 ` Mel Gorman
0 siblings, 1 reply; 11+ messages in thread
From: Li, Ming Chun @ 2009-08-07 19:32 UTC (permalink / raw)
To: Mel Gorman; +Cc: LKML, linux-mm
On Fri, 7 Aug 2009, Li, Ming Chun wrote:
> On Fri, 7 Aug 2009, Mel Gorman wrote:
>
> > +sub generate_traceevent_regex {
> > + my $event = shift;
> > + my $default = shift;
> > + my @fields = @_;
> > + my $regex;
>
> You are using shift to retrieve parameters below, @fields is not used
> anywhere.
>
> > +
> > + # Read the event format or use the default
> > + if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
> > + $regex = $default;
> > + } else {
> > + my $line;
> > + while (!eof(FORMAT)) {
> > + $line = <FORMAT>;
> > + if ($line =~ /^print fmt:\s"(.*)",.*/) {
> > + $regex = $1;
> > + $regex =~ s/%p/\([0-9a-f]*\)/g;
> > + $regex =~ s/%d/\([-0-9]*\)/g;
> > + $regex =~ s/%lu/\([0-9]*\)/g;
> > + }
> > + }
> > + }
> > +
> > + # Verify fields are in the right order
> > + my $tuple;
> > + foreach $tuple (split /\s/, $regex) {
> > + my ($key, $value) = split(/=/, $tuple);
> > + my $expected = shift;
> > + if ($key ne $expected) {
> > + print("WARNING: Format not as expected '$key' != '$expected'");
> > + $regex =~ s/$key=\((.*)\)/$key=$1/;
> > + }
> > + }
> > + if (defined $_) {
> > + die("Fewer fields than expected in format");
> > + }
> > +
>
> How about:
> if (defined shift) {
> die("Fewer fields than expected in format");
> }
> ?
>
> I don't know, just ask if it is clear.
Ah, I think it should be:
if (@_) {
die("Fewer fields than expected in format");
}
? Sorry for the noise :)
>
> > + return $regex;
> > +}
>
>
Vincent Li
Biomedical Research Center
University of British Columbia
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
2009-08-07 19:32 ` Li, Ming Chun
@ 2009-08-10 7:41 ` Mel Gorman
0 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2009-08-10 7:41 UTC (permalink / raw)
To: Li, Ming Chun; +Cc: LKML, linux-mm
On Fri, Aug 07, 2009 at 12:32:34PM -0700, Li, Ming Chun wrote:
> On Fri, 7 Aug 2009, Li, Ming Chun wrote:
>
> > On Fri, 7 Aug 2009, Mel Gorman wrote:
> >
> > > +sub generate_traceevent_regex {
> > > + my $event = shift;
> > > + my $default = shift;
> > > + my @fields = @_;
> > > + my $regex;
> >
> > You are using shift to retrieve parameters below, @fields is not used
> > anywhere.
> >
Correct, this can be removed. Initially, I was going to use an array but
shift was far neater. Forgot to cleanup afterwards.
> > > +
> > > + # Read the event format or use the default
> > > + if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
> > > + $regex = $default;
> > > + } else {
> > > + my $line;
> > > + while (!eof(FORMAT)) {
> > > + $line = <FORMAT>;
> > > + if ($line =~ /^print fmt:\s"(.*)",.*/) {
> > > + $regex = $1;
> > > + $regex =~ s/%p/\([0-9a-f]*\)/g;
> > > + $regex =~ s/%d/\([-0-9]*\)/g;
> > > + $regex =~ s/%lu/\([0-9]*\)/g;
> > > + }
> > > + }
> > > + }
> > > +
> > > + # Verify fields are in the right order
> > > + my $tuple;
> > > + foreach $tuple (split /\s/, $regex) {
> > > + my ($key, $value) = split(/=/, $tuple);
> > > + my $expected = shift;
> > > + if ($key ne $expected) {
> > > + print("WARNING: Format not as expected '$key' != '$expected'");
> > > + $regex =~ s/$key=\((.*)\)/$key=$1/;
> > > + }
> > > + }
> > > + if (defined $_) {
> > > + die("Fewer fields than expected in format");
> > > + }
> > > +
> >
> > How about:
> > if (defined shift) {
> > die("Fewer fields than expected in format");
> > }
> > ?
> >
> > I don't know, just ask if it is clear.
>
> Ah, I think it should be:
> if (@_) {
> die("Fewer fields than expected in format");
> }
>
> ? Sorry for the noise :)
>
It's not noise at all, you're right to point out something was wrong
here. It needed to be either
if (defined shift)
if (defined $_[0])
I went with your first suggestion of "if (defined shift)"
Thanks
==== CUT HERE ====
tracing, page-allocator: Fix sanity check of TP_printk format for mm_page_alloc_extfrag
The trace-pagealloc-postprocess.pl script sanity checks the TP_printk
format for mm_page_alloc_extfrag to ensure all expected fields are in
the output format. Ming Chun Li pointed out that the check for all
expected fields is checking the wrong scalar and that there was a
unused @fields declared. This patch deletes the unused variable and
fixes the check.
Reported-by: Ming Chun Li <macli@brc.ubc.ca>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
Documentation/trace/postprocess/trace-pagealloc-postprocess.pl | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
index 1a8a408..7df50e8 100755
--- a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -91,7 +91,6 @@ my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
sub generate_traceevent_regex {
my $event = shift;
my $default = shift;
- my @fields = @_;
my $regex;
# Read the event format or use the default
@@ -120,7 +119,8 @@ sub generate_traceevent_regex {
$regex =~ s/$key=\((.*)\)/$key=$1/;
}
}
- if (defined $_) {
+
+ if (defined shift) {
die("Fewer fields than expected in format");
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
` (3 preceding siblings ...)
2009-08-07 17:40 ` [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
2009-08-07 17:40 ` [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints Mel Gorman
5 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
To: Larry Woodman, Ingo Molnar, Andrew Morton
Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman
The documentation for ftrace, events and tracepoints is pretty
extensive. Similarly, the perf PCL tools help files --help are there and
the code simple enough to figure out what much of the switches mean.
However, pulling the discrete bits and pieces together and translating
that into "how do I solve a problem" requires a fair amount of
imagination.
This patch adds a simple document intended to get someone started on the
different ways of using tracepoints to gather meaningful data.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
Documentation/trace/tracepoint-analysis.txt | 327 +++++++++++++++++++++++++++
1 files changed, 327 insertions(+), 0 deletions(-)
create mode 100644 Documentation/trace/tracepoint-analysis.txt
diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 0000000..e7a7d3e
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
+ Notes on Analysing Behaviour Using Events and Tracepoints
+
+ Documentation written by Mel Gorman
+ PCL information heavily based on email from Ingo Molnar
+
+1. Introduction
+===============
+
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
+creating custom kernel modules to register probe functions using the event
+tracing infrastructure.
+
+Simplistically, tracepoints will represent an important event that when can
+be taken in conjunction with other tracepoints to build a "Big Picture" of
+what is going on within the system. There are a large number of methods for
+gathering and interpreting these events. Lacking any current Best Practises,
+this document describes some of the methods that can be used.
+
+This document assumes that debugfs is mounted on /sys/kernel/debug and that
+the appropriate tracing options have been configured into the kernel. It is
+assumed that the PCL tool tools/perf has been installed and is in your path.
+
+2. Listing Available Events
+===========================
+
+2.1 Standard Utilities
+----------------------
+
+All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+calling
+
+ $ find /sys/kernel/debug/tracing/events -type d
+
+will give a fair indication of the number of events available.
+
+2.2 PCL
+-------
+
+Discovery and enumeration of all counters and events, including tracepoints
+are available with the perf tool. Getting a list of available events is a
+simple case of
+
+ $ perf list 2>&1 | grep Tracepoint
+ ext4:ext4_free_inode [Tracepoint event]
+ ext4:ext4_request_inode [Tracepoint event]
+ ext4:ext4_allocate_inode [Tracepoint event]
+ ext4:ext4_write_begin [Tracepoint event]
+ ext4:ext4_ordered_write_end [Tracepoint event]
+ [ .... remaining output snipped .... ]
+
+
+2. Enabling Events
+==================
+
+2.1 System-Wide Event Enabling
+------------------------------
+
+See Documentation/trace/events.txt for a proper description on how events
+can be enabled system-wide. A short example of enabling all events related
+to page allocation would look something like
+
+ $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+
+2.2 System-Wide Event Enabling with SystemTap
+---------------------------------------------
+
+In SystemTap, tracepoints are accessible using the kernel.trace() function
+call. The following is an example that reports every 5 seconds what processes
+were allocating the pages.
+
+ global page_allocs
+
+ probe kernel.trace("mm_page_alloc") {
+ page_allocs[execname()]++
+ }
+
+ function print_count() {
+ printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
+ foreach (proc in page_allocs-)
+ printf("%-25d %s\n", page_allocs[proc], proc)
+ printf ("\n")
+ delete page_allocs
+ }
+
+ probe timer.s(5) {
+ print_count()
+ }
+
+2.3 System-Wide Event Enabling with PCL
+---------------------------------------
+
+By specifying the -a switch and analysing sleep, the system-wide events
+for a duration of time can be examined.
+
+ $ perf stat -a \
+ -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+ -e kmem:mm_pagevec_free \
+ sleep 10
+ Performance counter stats for 'sleep 10':
+
+ 9630 kmem:mm_page_alloc
+ 2143 kmem:mm_page_free_direct
+ 7424 kmem:mm_pagevec_free
+
+ 10.002577764 seconds time elapsed
+
+Similarly, one could execute a shell and exit it as desired to get a report
+at that point.
+
+2.4 Local Event Enabling
+------------------------
+
+Documentation/trace/ftrace.txt describes how to enable events on a per-thread
+basis using set_ftrace_pid.
+
+2.5 Local Event Enablement with PCL
+-----------------------------------
+
+Events can be activate and tracked for the duration of a process on a local
+basis using PCL such as follows.
+
+ $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+ -e kmem:mm_pagevec_free ./hackbench 10
+ Time: 0.909
+
+ Performance counter stats for './hackbench 10':
+
+ 17803 kmem:mm_page_alloc
+ 12398 kmem:mm_page_free_direct
+ 4827 kmem:mm_pagevec_free
+
+ 0.973913387 seconds time elapsed
+
+3. Event Filtering
+==================
+
+Documentation/trace/ftrace.txt covers in-depth how to filter events in
+ftrace. Obviously using grep and awk of trace_pipe is an option as well
+as any script reading trace_pipe.
+
+4. Analysing Event Variances with PCL
+=====================================
+
+Any workload can exhibit variances between runs and it can be important
+to know what the standard deviation in. By and large, this is left to the
+performance analyst to do it by hand. In the event that the discrete event
+occurrences are useful to the performance analyst, then perf can be used.
+
+ $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
+ -e kmem:mm_pagevec_free ./hackbench 10
+ Time: 0.890
+ Time: 0.895
+ Time: 0.915
+ Time: 1.001
+ Time: 0.899
+
+ Performance counter stats for './hackbench 10' (5 runs):
+
+ 16630 kmem:mm_page_alloc ( +- 3.542% )
+ 11486 kmem:mm_page_free_direct ( +- 4.771% )
+ 4730 kmem:mm_pagevec_free ( +- 2.325% )
+
+ 0.982653002 seconds time elapsed ( +- 1.448% )
+
+In the event that some higher-level event is required that depends on some
+aggregation of discrete events, then a script would need to be developed.
+
+Using --repeat, it is also possible to view how events are fluctuating over
+time on a system wide basis using -a and sleep.
+
+ $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+ -e kmem:mm_pagevec_free \
+ -a --repeat 10 \
+ sleep 1
+ Performance counter stats for 'sleep 1' (10 runs):
+
+ 1066 kmem:mm_page_alloc ( +- 26.148% )
+ 182 kmem:mm_page_free_direct ( +- 5.464% )
+ 890 kmem:mm_pagevec_free ( +- 30.079% )
+
+ 1.002251757 seconds time elapsed ( +- 0.005% )
+
+5. Higher-Level Analysis with Helper Scripts
+============================================
+
+When events are enabled the events that are triggering can be read from
+/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+options exist as well. By post-processing the output, further information can
+be gathered on-line as appropriate. Examples of post-processing might include
+
+ o Reading information from /proc for the PID that triggered the event
+ o Deriving a higher-level event from a series of lower-level events.
+ o Calculate latencies between two events
+
+Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
+script that can read trace_pipe from STDIN or a copy of a trace. When used
+on-line, it can be interrupted once to generate a report without existing
+and twice to exit.
+
+Simplistically, the script just reads STDIN and counts up events but it
+also can do more such as
+
+ o Derive high-level events from many low-level events. If a number of pages
+ are freed to the main allocator from the per-CPU lists, it recognises
+ that as one per-CPU drain even though there is no specific tracepoint
+ for that event
+ o It can aggregate based on PID or individual process number
+ o In the event memory is getting externally fragmented, it reports
+ on whether the fragmentation event was severe or moderate.
+ o When receiving an event about a PID, it can record who the parent was so
+ that if large numbers of events are coming from very short-lived
+ processes, the parent process responsible for creating all the helpers
+ can be identified
+
+6. Lower-Level Analysis with PCL
+================================
+
+There may also be a requirement to identify what functions with a program
+were generating events within the kernel. To begin this sort of analysis, the
+data must be recorded. At the time of writing, this required root
+
+ $ perf record -c 1 \
+ -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+ -e kmem:mm_pagevec_free \
+ ./hackbench 10
+ Time: 0.894
+ [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
+
+Note the use of '-c 1' to set the event period to sample. The default sample
+period is quite high to minimise overhead but the information collected can be
+very coarse as a result.
+
+This record outputted a file called perf.data which can be analysed using
+perf report.
+
+ $ perf report
+ # Samples: 30922
+ #
+ # Overhead Command Shared Object
+ # ........ ......... ................................
+ #
+ 87.27% hackbench [vdso]
+ 6.85% hackbench /lib/i686/cmov/libc-2.9.so
+ 2.62% hackbench /lib/ld-2.9.so
+ 1.52% perf [vdso]
+ 1.22% hackbench ./hackbench
+ 0.48% hackbench [kernel]
+ 0.02% perf /lib/i686/cmov/libc-2.9.so
+ 0.01% perf /usr/bin/perf
+ 0.01% perf /lib/ld-2.9.so
+ 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so
+ #
+ # (For more details, try: perf report --sort comm,dso,symbol)
+ #
+
+According to this, the vast majority of events occured triggered on events
+within the VDSO. With simple binaries, this will often be the case so lets
+take a slightly different example. In the course of writing this, it was
+noticed that X was generating an insane amount of page allocations so lets look
+at it
+
+ $ perf record -c 1 -f \
+ -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+ -e kmem:mm_pagevec_free \
+ -p `pidof X`
+
+This was interrupted after a few seconds and
+
+ $ perf report
+ # Samples: 27666
+ #
+ # Overhead Command Shared Object
+ # ........ ....... .......................................
+ #
+ 51.95% Xorg [vdso]
+ 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1
+ 0.09% Xorg /lib/i686/cmov/libc-2.9.so
+ 0.01% Xorg [kernel]
+ #
+ # (For more details, try: perf report --sort comm,dso,symbol)
+ #
+
+So, almost half of the events are occuring in a library. To get an idea which
+symbol.
+
+ $ perf report --sort comm,dso,symbol
+ # Samples: 27666
+ #
+ # Overhead Command Shared Object Symbol
+ # ........ ....... ....................................... ......
+ #
+ 51.95% Xorg [vdso] [.] 0x000000ffffe424
+ 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2
+ 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc
+ 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f
+ 0.01% Xorg [kernel] [k] read_hpet
+ 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path
+ 0.00% Xorg [kernel] [k] ftrace_trace_userstack
+
+To see where within the function pixmanFillsse2 things are going wrong
+
+ $ perf annotate pixmanFillsse2
+ [ ... ]
+ 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax)
+ : }
+ :
+ : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+ : _mm_store_si128 (__m128i *__P, __m128i __B) : {
+ : *__P = __B;
+ 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax)
+ 0.00 : 34ef5: ff
+ 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax)
+ 0.00 : 34efd: ff
+ 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax)
+ 0.00 : 34f05: ff
+ 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax)
+ 0.00 : 34f0d: ff
+ 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax)
+ 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax)
+ 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax)
+ 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax)
+
+At a glance, it looks like the time is being spent copying pixmaps to
+the card. Further investigation would be needed to determine why pixmaps
+are being copied around so much but a starting point would be to take an
+ancient build of libpixmap out of the library path where it was totally
+forgotten about from months ago!
--
1.6.3.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints
2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
` (4 preceding siblings ...)
2009-08-07 17:40 ` [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
5 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
To: Larry Woodman, Ingo Molnar, Andrew Morton
Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman
Knowing tracepoints exist is not quite the same as knowing what they
should be used for. This patch adds a document giving a basic
description of the kmem tracepoints and why they might be useful to a
performance analyst.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
Documentation/trace/events-kmem.txt | 107 +++++++++++++++++++++++++++++++++++
1 files changed, 107 insertions(+), 0 deletions(-)
create mode 100644 Documentation/trace/events-kmem.txt
diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 0000000..6ef2a86
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
+ Subsystem Trace Points: kmem
+
+The tracing system kmem captures events related to object and page allocation
+within the kernel. Broadly speaking there are four major subheadings.
+
+ o Slab allocation of small objects of unknown type (kmalloc)
+ o Slab allocation of small objects of known type
+ o Page allocation
+ o Per-CPU Allocator Activity
+ o External Fragmentation
+
+This document will describe what each of the tracepoints are and why they
+might be useful.
+
+1. Slab allocation of small objects of unknown type
+===================================================
+kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kfree call_site=%lx ptr=%p
+
+Heavy activity for these events may indicate that a specific cache is
+justified, particularly if kmalloc slab pages are getting significantly
+internal fragmented as a result of the allocation pattern. By correlating
+kmalloc with kfree, it may be possible to identify memory leaks and where
+the allocation sites were.
+
+
+2. Slab allocation of small objects of known type
+=================================================
+kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kmem_cache_free call_site=%lx ptr=%p
+
+These events are similar in usage to the kmalloc-related events except that
+it is likely easier to pin the event down to a specific cache. At the time
+of writing, no information is available on what slab is being allocated from,
+but the call_site can usually be used to extrapolate that information
+
+3. Page allocation
+==================
+mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_free_direct page=%p pfn=%lu order=%d
+mm_pagevec_free page=%p pfn=%lu order=%d cold=%d
+
+These four events deal with page allocation and freeing. mm_page_alloc is
+a simple indicator of page allocator activity. Pages may be allocated from
+the per-CPU allocator (high performance) or the buddy allocator.
+
+If pages are allocated directly from the buddy allocator, the
+mm_page_alloc_zone_locked event is triggered. This event is important as high
+amounts of activity imply high activity on the zone->lock. Taking this lock
+impairs performance by disabling interrupts, dirtying cache lines between
+CPUs and serialising many CPUs.
+
+When a page is freed directly by the caller, the mm_page_free_direct event
+is triggered. Significant amounts of activity here could indicate that the
+callers should be batching their activities.
+
+When pages are freed using a pagevec, the mm_pagevec_free is
+triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
+freed in batch with a pagevec. Significant amounts of activity here could
+indicate that the system is under memory pressure and can also indicate
+contention on the zone->lru_lock.
+
+4. Per-CPU Allocator Activity
+=============================
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d
+
+In front of the page allocator is a per-cpu page allocator. It exists only
+for order-0 pages, reduces contention on the zone->lock and reduces the
+amount of writing on struct page.
+
+When a per-CPU list is empty or pages of the wrong type are allocated,
+the zone->lock will be taken once and the per-CPU list refilled. The event
+triggered is mm_page_alloc_zone_locked for each page allocated with the
+event indicating whether it is for a percpu_refill or not.
+
+When the per-CPU list is too full, a number of pages are freed, each one
+which triggers a mm_page_pcpu_drain event.
+
+The individual nature of the events are so that pages can be tracked
+between allocation and freeing. A number of drain or refill pages that occur
+consecutively imply the zone->lock being taken once. Large amounts of PCP
+refills and drains could imply an imbalance between CPUs where too much work
+is being concentrated in one place. It could also indicate that the per-CPU
+lists should be a larger size. Finally, large amounts of refills on one CPU
+and drains on another could be a factor in causing large amounts of cache
+line bounces due to writes between CPUs and worth investigating if pages
+can be allocated and freed on the same CPU through some algorithm change.
+
+5. External Fragmentation
+=========================
+mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
+
+External fragmentation affects whether a high-order allocation will be
+successful or not. For some types of hardware, this is important although
+it is avoided where possible. If the system is using huge pages and needs
+to be able to resize the pool over the lifetime of the system, this value
+is important.
+
+Large numbers of this event implies that memory is fragmenting and
+high-order allocations will start failing at some time in the future. One
+means of reducing the occurange of this event is to increase the size of
+min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
+pageblock_size is usually the size of the default hugepage size.
--
1.6.3.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 11+ messages in thread