[PATCH 0/6] Add some trace events for the page allocator v5

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] Add some trace events for the page allocator v5
@ 2009-08-07 17:40 Mel Gorman
  2009-08-07 17:40 ` [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman

This is V5 of a patchset. The main changes are due to feedback from Ingo on
optimising tracepoints to store the least amount of information in the record
during the trace fastpath and delay until the TP_printk step.  The regex
handling in the tracepoint postprocessing script is also a bit more robust.

Changelog since V4
  o Drop the order parameter from mm_pagevec_free() as it's implicitly 0
  o Drop the cpu= information from PCPU events as the CPU printed is
    incorrect and the information is already available
  o Pass down the minimum amount of information during fallback and the
    zone_locked event as the additional information in TP_printk
  o Pass down the minimum amount of information during fallback and figure
    out the additional information in the post-processing TP_printk
  o Make the post-processing script more robust against format changes.
    This could be significantly more robust and construct a regex on
    the fly but it makes more sense to leave this as a POC with the
    view to integrating properly with perf once the important information
    has been identified
  o Exit the script after multiple signals without waiting for further input

Changelog since V3
  o Drop call_site information from trace events
  o Use struct page * instead of void * in trace events
  o Add CPU information to the per-cpu tracepoints information
  o Improved performance of offline-process script so it can run online
  o Add support for interrupting processing script to dump what it has
  o Add support for stripping pids, getting additional information from
    proc and adding information on the parent process
  o Improve layout of output of post processing script for use with sort
  o Add documentation on performance analysis using tracepoints
  o Add documentation on the kmem tracepoints in particular

Changelog since V2
  o Added Ack-ed By's from Rik
  o Only call trace_mm_page_free_direct when page count reaches zero
  o Rebase to 2.6.31-rc5

Changelog since V1
  o Fix minor formatting error for the __rmqueue event
  o Add event for __pagevec_free
  o Bring naming more in line with Larry Woodman's tracing patch
  o Add an example post-processing script for the trace events

The following four patches add some trace events for the page allocator
under the heading of kmem.

	Patch 1 adds events for plain old allocate and freeing of pages
	Patch 2 gives information useful for analysing fragmentation avoidance
	Patch 3 tracks pages going to and from the buddy lists as an indirect
		indication of zone lock hotness
	Patch 4 adds a post-processing script that aggegates the events to
		give a higher-level view
	Patch 5 adds documentation on analysis using tracepoints
	Patch 6 adds documentation on the kmem tracepoints in particular

The first set of events can be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The second patch is mainly of use to users of hugepages and particularly
dynamic hugepage pool resizing as it could be used to tune min_free_kbytes
to a level that fragmentation was rarely a problem. My main concern is
that maybe I'm trying to jam too much into the TP_printk that could be
extrapolated after the fact if you were familiar with the implementation. I
couldn't determine if it was best to hold the hand of the administrator
even if it cost more to figure it out.

The third patch is trickier to draw conclusions from but high activity on
those events could explain why there were a large number of cache misses
on a page-allocator-intensive workload. The coalescing and splitting of
buddies involves a lot of writing of page metadata and cache line bounces
not to mention the acquisition of an interrupt-safe lock necessary to enter
this path.

The fourth patch parses the trace buffer to draw a higher-level picture of
what is going on broken down on a per-process basis.

The last two patches add documentation.

 Documentation/trace/events-kmem.txt                |  107 +++++
 .../postprocess/trace-pagealloc-postprocess.pl     |  418 ++++++++++++++++++++
 Documentation/trace/tracepoint-analysis.txt        |  327 +++++++++++++++
 include/trace/events/kmem.h                        |  163 ++++++++
 mm/page_alloc.c                                    |   13 +-
 5 files changed, 1027 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
  2009-08-08  5:47   ` KOSAKI Motohiro
  2009-08-07 17:40 ` [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman

This patch adds trace events for the allocation and freeing of pages,
including the freeing of pagevecs.  Using the events, it will be known what
struct page and pfns are being allocated and freed and what the call site
was in many cases.

The page alloc tracepoints be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The pagevec free tracepoint has a different usecase. It can be used to get
a idea of how many pages are being dumped off the LRU and whether it is
kswapd doing the work or a process doing direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   74 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    7 +++-
 2 files changed, 80 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 1493c54..0d358a0 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -225,6 +225,80 @@ TRACE_EVENT(kmem_cache_free,
 
 	TP_printk("call_site=%lx ptr=%p", __entry->call_site, __entry->ptr)
 );
+
+TRACE_EVENT(mm_page_free_direct,
+
+	TP_PROTO(struct page *page, unsigned int order),
+
+	TP_ARGS(page, order),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d",
+			__entry->page,
+			page_to_pfn(__entry->page),
+			__entry->order)
+);
+
+TRACE_EVENT(mm_pagevec_free,
+
+	TP_PROTO(struct page *page, int cold),
+
+	TP_ARGS(page, cold),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	int,		cold		)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->cold		= cold;
+	),
+
+	TP_printk("page=%p pfn=%lu order=0 cold=%d",
+			__entry->page,
+			page_to_pfn(__entry->page),
+			__entry->cold)
+);
+
+TRACE_EVENT(mm_page_alloc,
+
+	TP_PROTO(struct page *page, unsigned int order,
+			gfp_t gfp_flags, int migratetype),
+
+	TP_ARGS(page, order, gfp_flags, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	gfp_t,		gfp_flags	)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->gfp_flags	= gfp_flags;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d052abb..440c439 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1062,6 +1062,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 
 void free_hot_page(struct page *page)
 {
+	trace_mm_page_free_direct(page, 0);
 	free_hot_cold_page(page, 0);
 }
 	
@@ -1905,6 +1906,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
 
+	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -1945,13 +1947,16 @@ void __pagevec_free(struct pagevec *pvec)
 {
 	int i = pagevec_count(pvec);
 
-	while (--i >= 0)
+	while (--i >= 0) {
+		trace_mm_pagevec_free(pvec->pages[i], pvec->cold);
 		free_hot_cold_page(pvec->pages[i], pvec->cold);
+	}
 }
 
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
+		trace_mm_page_free_direct(page, order);
 		if (order == 0)
 			free_hot_page(page);
 		else
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-07 17:40 ` [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
@ 2009-08-08  5:47   ` KOSAKI Motohiro
  0 siblings, 0 replies; 11+ messages in thread
From: KOSAKI Motohiro @ 2009-08-08  5:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Ingo Molnar, Andrew Morton, riel, Peter Zijlstra,
	LKML, linux-mm

2009/8/8 Mel Gorman <mel@csn.ul.ie>:
> This patch adds trace events for the allocation and freeing of pages,
> including the freeing of pagevecs.  Using the events, it will be known what
> struct page and pfns are being allocated and freed and what the call site
> was in many cases.
>
> The page alloc tracepoints be used as an indicator as to whether the workload
> was heavily dependant on the page allocator or not. You can make a guess based
> on vmstat but you can't get a per-process breakdown. Depending on the call
> path, the call_site for page allocation may be __get_free_pages() instead
> of a useful callsite. Instead of passing down a return address similar to
> slab debugging, the user should enable the stacktrace and seg-addr options
> to get a proper stack trace.
>
> The pagevec free tracepoint has a different usecase. It can be used to get
> a idea of how many pages are being dumped off the LRU and whether it is
> kswapd doing the work or a process doing direct reclaim.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>

Looks good to me.
  Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes
  2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
  2009-08-07 17:40 ` [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
  2009-08-07 17:40 ` [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman

Fragmentation avoidance depends on being able to use free pages from
lists of the appropriate migrate type. In the event this is not
possible, __rmqueue_fallback() selects a different list and in some
circumstances change the migratetype of the pageblock. Simplistically,
the more times this event occurs, the more likely that fragmentation
will be a problem later for hugepage allocation at least but there are
other considerations such as the order of page being split to satisfy
the allocation.

This patch adds a trace event for __rmqueue_fallback() that reports what
page is being used for the fallback, the orders of relevant pages, the
desired migratetype and the migratetype of the lists being used, whether
the pageblock changed type and whether this event is important with
respect to fragmentation avoidance or not. This information can be used
to help analyse fragmentation avoidance and help decide whether
min_free_kbytes should be increased or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    4 ++++
 2 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 0d358a0..aae16ee 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -299,6 +299,44 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_extfrag,
+
+	TP_PROTO(struct page *page,
+			int alloc_order, int fallback_order,
+			int alloc_migratetype, int fallback_migratetype),
+
+	TP_ARGS(page,
+		alloc_order, fallback_order,
+		alloc_migratetype, fallback_migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page			)
+		__field(	int,		alloc_order		)
+		__field(	int,		fallback_order		)
+		__field(	int,		alloc_migratetype	)
+		__field(	int,		fallback_migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page			= page;
+		__entry->alloc_order		= alloc_order;
+		__entry->fallback_order		= fallback_order;
+		__entry->alloc_migratetype	= alloc_migratetype;
+		__entry->fallback_migratetype	= fallback_migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->alloc_order,
+		__entry->fallback_order,
+		pageblock_order,
+		__entry->alloc_migratetype,
+		__entry->fallback_migratetype,
+		__entry->fallback_order < pageblock_order,
+		__entry->alloc_migratetype == __entry->fallback_migratetype)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 440c439..4c20bfa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -839,6 +839,10 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 							start_migratetype);
 
 			expand(zone, page, order, current_order, area, migratetype);
+
+			trace_mm_page_alloc_extfrag(page, order, current_order,
+				start_migratetype, migratetype);
+
 			return page;
 		}
 	}
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
  2009-08-07 17:40 ` [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
  2009-08-07 17:40 ` [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
  2009-08-07 17:40 ` [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman

The page allocation trace event reports that a page was successfully allocated
but it does not specify where it came from. When analysing performance,
it can be important to distinguish between pages coming from the per-cpu
allocator and pages coming from the buddy lists as the latter requires the
zone lock to the taken and more data structures to be examined.

This patch adds a trace event for __rmqueue reporting when a page is being
allocated from the buddy lists. It distinguishes between being called
to refill the per-cpu lists or whether it is a high-order allocation.
Similarly, this patch adds an event to catch when the PCP lists are being
drained a little and pages are going back to the buddy lists.

This is trickier to draw conclusions from but high activity on those
events could explain why there were a large number of cache misses on a
page-allocator-intensive workload. The coalescing and splitting of buddies
involves a lot of writing of page metadata and cache line bounces not to
mention the acquisition of an interrupt-safe lock necessary to enter this
path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   51 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    2 +
 2 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index aae16ee..eaf46bd 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -299,6 +299,57 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_zone_locked,
+
+	TP_PROTO(struct page *page, unsigned int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%u migratetype=%d percpu_refill=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		__entry->order == 0)
+);
+
+TRACE_EVENT(mm_page_pcpu_drain,
+
+	TP_PROTO(struct page *page, int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	struct page *,	page		)
+		__field(	int,		order		)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d migratetype=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->order,
+		__entry->migratetype)
+);
+
 TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(struct page *page,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4c20bfa..bbd7de8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -535,6 +535,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
+		trace_mm_page_pcpu_drain(page, order, page_private(page));
 		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
@@ -876,6 +877,7 @@ retry_reserve:
 		}
 	}
 
+	trace_mm_page_alloc_zone_locked(page, order, migratetype);
 	return page;
 }
 
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
                   ` (2 preceding siblings ...)
  2009-08-07 17:40 ` [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
  2009-08-07 19:10   ` Li, Ming Chun
  2009-08-07 17:40 ` [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints Mel Gorman
  2009-08-07 17:40 ` [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints Mel Gorman
  5 siblings, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman

This patch adds a simple post-processing script for the page-allocator-related
trace events. It can be used to give an indication of who the most
allocator-intensive processes are and how often the zone lock was taken
during the tracing period. Example output looks like

Process                   Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
details                  allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                under lock     direct  pagevec      drain
swapper-0                     0          0          2        0          0        0        0          0        0        0        0        0        0
Xorg-3770                 10603       5952       3685     6978       5996      194      192          0        0        0        0        0        0
modprobe-21397               51          0          0       86         31        1        0          0        0        0        0        0        0
xchat-5370                  228         93          0        0          0        0        3          0        0        0        0        0        0
awesome-4317                 32         32          0        0          0        0       32          0        0        0        0        0        0
thinkfan-3863                 2          0          1        1          0        0        0          0        0        0        0        0        0
hald-addon-stor-3935          2          0          0        0          0        0        0          0        0        0        0        0        0
akregator-4506                1          1          0        0          0        0        1          0        0        0        0        0        0
xmms-14888                    0          0          1        0          0        0        0          0        0        0        0        0        0
khelper-12                    1          0          0        0          0        0        0          0        0        0        0        0        0

Optionally, the output can include information on the parent or aggregate
based on process name instead of aggregating based on each pid. Example output
including parent information and stripped out the PID looks something like;

Process                        Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
details                       allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                     under lock     direct  pagevec      drain
gdm-3756 :: Xorg-3770           3796       2976         99     3813       3224      104       98          0        0        0        0        0        0
init-1 :: hald-3892                1          0          0        0          0        0        0          0        0        0        0        0        0
git-21447 :: editor-21448          4          0          4        0          0        0        0          0        0        0        0        0        0

This says that Xorg allocated 3796 pages and it's parent process is gdm
with a PID of 3756;

The postprocessor parses the text output of tracing. While there is a binary
format, the expectation is that the binary output can be readily translated
into text and post-processed offline. Obviously if the text format changes,
the parser will break but the regular expression parser is fairly rudimentary
so should be readily adjustable.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../postprocess/trace-pagealloc-postprocess.pl     |  418 ++++++++++++++++++++
 1 files changed, 418 insertions(+), 0 deletions(-)
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100755
index 0000000..1a8a408
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,418 @@
+#!/usr/bin/perl
+# This is a POC (proof of concept or piece of crap, take your pick) for reading the
+# text representation of trace output related to page allocation. It makes an attempt
+# to extract some high-level information on what is going on. The accuracy of the parser
+# may vary considerably
+#
+# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --prepend-parent	Report on the parent proc and PID
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_PAGE_ALLOC		=> 1;
+use constant MM_PAGE_FREE_DIRECT 	=> 2;
+use constant MM_PAGEVEC_FREE		=> 3;
+use constant MM_PAGE_PCPU_DRAIN		=> 4;
+use constant MM_PAGE_ALLOC_ZONE_LOCKED	=> 5;
+use constant MM_PAGE_ALLOC_EXTFRAG	=> 6;
+use constant EVENT_UNKNOWN		=> 7;
+
+# Constants used to track state
+use constant STATE_PCPU_PAGES_DRAINED	=> 8;
+use constant STATE_PCPU_PAGES_REFILLED	=> 9;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_PCPU_DRAINS		=> 10;
+use constant HIGH_PCPU_REFILLS		=> 11;
+use constant HIGH_EXT_FRAGMENT		=> 12;
+use constant HIGH_EXT_FRAGMENT_SEVERE	=> 13;
+use constant HIGH_EXT_FRAGMENT_MODERATE	=> 14;
+use constant HIGH_EXT_FRAGMENT_CHANGED	=> 15;
+
+my %perprocesspid;
+my %perprocess;
+my $opt_ignorepid;
+my $opt_read_procstat;
+my $opt_prepend_parent;
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		if (!$sigint_exit) {
+			print "Second SIGINT received quickly, exiting\n";
+		}
+		$sigint_exit++;
+	}
+
+	if ($sigint_exit > 3) {
+		print "Many SIGINTs received, exiting now without report\n";
+		exit;
+	}
+
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+	'prepend-parent' =>	\$opt_prepend_parent,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';
+
+# Dyanically discovered regex
+my $regex_fragdetails;
+
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+	my $event = shift;
+	my $default = shift;
+	my @fields = @_;
+	my $regex;
+
+	# Read the event format or use the default
+	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+		$regex = $default;
+	} else {
+		my $line;
+		while (!eof(FORMAT)) {
+			$line = <FORMAT>;
+			if ($line =~ /^print fmt:\s"(.*)",.*/) {
+				$regex = $1;
+				$regex =~ s/%p/\([0-9a-f]*\)/g;
+				$regex =~ s/%d/\([-0-9]*\)/g;
+				$regex =~ s/%lu/\([0-9]*\)/g;
+			}
+		}
+	}
+
+	# Verify fields are in the right order
+	my $tuple;
+	foreach $tuple (split /\s/, $regex) {
+		my ($key, $value) = split(/=/, $tuple);
+		my $expected = shift;
+		if ($key ne $expected) {
+			print("WARNING: Format not as expected '$key' != '$expected'");
+			$regex =~ s/$key=\((.*)\)/$key=$1/;
+		}
+	}
+	if (defined $_) {
+		die("Fewer fields than expected in format");
+	}
+
+	return $regex;
+}
+$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
+			$regex_fragdetails_default,
+			"page", "pfn",
+			"alloc_order", "fallback_order", "pageblock_order",
+			"alloc_migratetype", "fallback_migratetype",
+			"fragmenting", "change_ownership");
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+sub parent_info($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+	my $ppid;
+
+	if ($pid == 0) {
+		return "NOPARENT-0";
+	}
+
+	if ($statline !~ /$regex_statppid/o) {
+		die("Failed to match stat line process ppid:: $statline");
+	}
+
+	# Read the ppid stat line
+	$ppid = $1;
+	return guess_process_pid($ppid, read_statline($ppid));
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$tracepoint = $4;
+
+			if ($opt_read_procstat || $opt_prepend_parent) {
+				$process_pid =~ /(.*)-([0-9]*)$/;
+				my $process = $1;
+				my $pid = $2;
+
+				$statline = read_statline($pid);
+
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+
+				if ($opt_prepend_parent) {
+					$process_pid = parent_info($pid, $statline) . " :: $process_pid";
+				}
+			}
+
+			# Unnecessary in this script. Uncomment if required
+			# $cpus = $2;
+			# $timestamp = $3;
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_page_alloc") {
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
+		} elsif ($tracepoint eq "mm_page_free_direct") {
+			$perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
+		} elsif ($tracepoint eq "mm_pagevec_free") {
+			$perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
+		} elsif ($tracepoint eq "mm_page_pcpu_drain") {
+			$perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
+		} elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
+		} elsif ($tracepoint eq "mm_page_alloc_extfrag") {
+
+			# Extract the details of the event now
+			$details = $5;
+
+			my ($page, $pfn);
+			my ($alloc_order, $fallback_order, $pageblock_order);
+			my ($alloc_migratetype, $fallback_migratetype);
+			my ($fragmenting, $change_ownership);
+
+			if ($details !~ /$regex_fragdetails/o) {
+				print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n";
+				next;
+			}
+
+			$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
+			$page = $1;
+			$pfn = $2;
+			$alloc_order = $3;
+			$fallback_order = $4;
+			$pageblock_order = $5;
+			$alloc_migratetype = $6;
+			$fallback_migratetype = $7;
+			$fragmenting = $8;
+			$change_ownership = $9;
+
+			if ($fragmenting) {
+				$perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
+				if ($fallback_order <= 3) {
+					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
+				} else {
+					$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
+				}
+			}
+			if ($change_ownership) {
+				$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		# Catch a full pcpu drain event
+		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
+				$tracepoint ne "mm_page_pcpu_drain") {
+
+			$perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+		}
+
+		# Catch a full pcpu refill event
+		if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
+				$tracepoint ne "mm_page_alloc_zone_locked") {
+			$perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
+			$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+		"Process", "Pages",  "Pages",      "Pages", "Pages", "PCPU",  "PCPU",   "PCPU",    "Fragment",  "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+		"details", "allocd", "allocd",     "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing",   "Changed", "Severe", "Moderate", "");
+
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+		"",        "",       "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
+
+	foreach $process_pid (keys %stats) {
+		# Dump final aggregates
+		if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
+			$stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
+			$stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+		}
+		if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
+			$stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
+			$stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8d %8d   %8d %8d %8d   %8d %8d %8d %8d %8d %8d\n",
+			$process_pid,
+			$stats{$process_pid}->{MM_PAGE_ALLOC},
+			$stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
+			$stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
+			$stats{$process_pid}->{MM_PAGEVEC_FREE},
+			$stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
+			$stats{$process_pid}->{HIGH_PCPU_DRAINS},
+			$stats{$process_pid}->{HIGH_PCPU_REFILLS},
+			$stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
+			$stats{$process_pid}->{HIGH_EXT_FRAG},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
+			$stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
+			$stats{$process_pid}->{EVENT_UNKNOWN});
+	}
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
+		$perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
+		$perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
+		$perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
+		$perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
+		$perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
+		$perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
+		$perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
+		$perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
+		$perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
+		$perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-07 17:40 ` [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
@ 2009-08-07 19:10   ` Li, Ming Chun
  2009-08-07 19:32     ` Li, Ming Chun
  0 siblings, 1 reply; 11+ messages in thread
From: Li, Ming Chun @ 2009-08-07 19:10 UTC (permalink / raw)
  To: Mel Gorman; +Cc: LKML, linux-mm

On Fri, 7 Aug 2009, Mel Gorman wrote:

> +sub generate_traceevent_regex {
> +	my $event = shift;
> +	my $default = shift;
> +	my @fields = @_;
> +	my $regex;

You are using shift to retrieve parameters below, @fields is not used 
anywhere.

> +
> +	# Read the event format or use the default
> +	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
> +		$regex = $default;
> +	} else {
> +		my $line;
> +		while (!eof(FORMAT)) {
> +			$line = <FORMAT>;
> +			if ($line =~ /^print fmt:\s"(.*)",.*/) {
> +				$regex = $1;
> +				$regex =~ s/%p/\([0-9a-f]*\)/g;
> +				$regex =~ s/%d/\([-0-9]*\)/g;
> +				$regex =~ s/%lu/\([0-9]*\)/g;
> +			}
> +		}
> +	}
> +
> +	# Verify fields are in the right order
> +	my $tuple;
> +	foreach $tuple (split /\s/, $regex) {
> +		my ($key, $value) = split(/=/, $tuple);
> +		my $expected = shift;
> +		if ($key ne $expected) {
> +			print("WARNING: Format not as expected '$key' != '$expected'");
> +			$regex =~ s/$key=\((.*)\)/$key=$1/;
> +		}
> +	}
> +	if (defined $_) {
> +		die("Fewer fields than expected in format");
> +	}
> +

How about:
	if (defined shift) {
		die("Fewer fields than expected in format");
	}
? 

I don't know, just ask if it is clear.

> +	return $regex;
> +}


Vincent Li
Biomedical Research Center
University of British Columbia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-07 19:10   ` Li, Ming Chun
@ 2009-08-07 19:32     ` Li, Ming Chun
  2009-08-10  7:41       ` Mel Gorman
  0 siblings, 1 reply; 11+ messages in thread
From: Li, Ming Chun @ 2009-08-07 19:32 UTC (permalink / raw)
  To: Mel Gorman; +Cc: LKML, linux-mm

On Fri, 7 Aug 2009, Li, Ming Chun wrote:

> On Fri, 7 Aug 2009, Mel Gorman wrote:
> 
> > +sub generate_traceevent_regex {
> > +	my $event = shift;
> > +	my $default = shift;
> > +	my @fields = @_;
> > +	my $regex;
> 
> You are using shift to retrieve parameters below, @fields is not used 
> anywhere.
> 
> > +
> > +	# Read the event format or use the default
> > +	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
> > +		$regex = $default;
> > +	} else {
> > +		my $line;
> > +		while (!eof(FORMAT)) {
> > +			$line = <FORMAT>;
> > +			if ($line =~ /^print fmt:\s"(.*)",.*/) {
> > +				$regex = $1;
> > +				$regex =~ s/%p/\([0-9a-f]*\)/g;
> > +				$regex =~ s/%d/\([-0-9]*\)/g;
> > +				$regex =~ s/%lu/\([0-9]*\)/g;
> > +			}
> > +		}
> > +	}
> > +
> > +	# Verify fields are in the right order
> > +	my $tuple;
> > +	foreach $tuple (split /\s/, $regex) {
> > +		my ($key, $value) = split(/=/, $tuple);
> > +		my $expected = shift;
> > +		if ($key ne $expected) {
> > +			print("WARNING: Format not as expected '$key' != '$expected'");
> > +			$regex =~ s/$key=\((.*)\)/$key=$1/;
> > +		}
> > +	}
> > +	if (defined $_) {
> > +		die("Fewer fields than expected in format");
> > +	}
> > +
> 
> How about:
> 	if (defined shift) {
> 		die("Fewer fields than expected in format");
> 	}
> ? 
> 
> I don't know, just ask if it is clear.

Ah, I think it should be:
	if (@_) {
		die("Fewer fields than expected in format");
	}

? Sorry for the noise :)

> 
> > +	return $regex;
> > +}
> 
> 

Vincent Li
Biomedical Research Center
University of British Columbia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-07 19:32     ` Li, Ming Chun
@ 2009-08-10  7:41       ` Mel Gorman
  0 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2009-08-10  7:41 UTC (permalink / raw)
  To: Li, Ming Chun; +Cc: LKML, linux-mm

On Fri, Aug 07, 2009 at 12:32:34PM -0700, Li, Ming Chun wrote:
> On Fri, 7 Aug 2009, Li, Ming Chun wrote:
> 
> > On Fri, 7 Aug 2009, Mel Gorman wrote:
> > 
> > > +sub generate_traceevent_regex {
> > > +	my $event = shift;
> > > +	my $default = shift;
> > > +	my @fields = @_;
> > > +	my $regex;
> > 
> > You are using shift to retrieve parameters below, @fields is not used 
> > anywhere.
> > 

Correct, this can be removed. Initially, I was going to use an array but
shift was far neater. Forgot to cleanup afterwards.

> > > +
> > > +	# Read the event format or use the default
> > > +	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
> > > +		$regex = $default;
> > > +	} else {
> > > +		my $line;
> > > +		while (!eof(FORMAT)) {
> > > +			$line = <FORMAT>;
> > > +			if ($line =~ /^print fmt:\s"(.*)",.*/) {
> > > +				$regex = $1;
> > > +				$regex =~ s/%p/\([0-9a-f]*\)/g;
> > > +				$regex =~ s/%d/\([-0-9]*\)/g;
> > > +				$regex =~ s/%lu/\([0-9]*\)/g;
> > > +			}
> > > +		}
> > > +	}
> > > +
> > > +	# Verify fields are in the right order
> > > +	my $tuple;
> > > +	foreach $tuple (split /\s/, $regex) {
> > > +		my ($key, $value) = split(/=/, $tuple);
> > > +		my $expected = shift;
> > > +		if ($key ne $expected) {
> > > +			print("WARNING: Format not as expected '$key' != '$expected'");
> > > +			$regex =~ s/$key=\((.*)\)/$key=$1/;
> > > +		}
> > > +	}
> > > +	if (defined $_) {
> > > +		die("Fewer fields than expected in format");
> > > +	}
> > > +
> > 
> > How about:
> > 	if (defined shift) {
> > 		die("Fewer fields than expected in format");
> > 	}
> > ? 
> > 
> > I don't know, just ask if it is clear.
> 
> Ah, I think it should be:
> 	if (@_) {
> 		die("Fewer fields than expected in format");
> 	}
> 
> ? Sorry for the noise :)
> 

It's not noise at all, you're right to point out something was wrong
here. It needed to be either

if (defined shift)
if (defined $_[0])

I went with your first suggestion of "if (defined shift)"

Thanks

==== CUT HERE ====
tracing, page-allocator: Fix sanity check of TP_printk format for mm_page_alloc_extfrag

The trace-pagealloc-postprocess.pl script sanity checks the TP_printk
format for mm_page_alloc_extfrag to ensure all expected fields are in
the output format. Ming Chun Li pointed out that the check for all
expected fields is checking the wrong scalar and that there was a
unused @fields declared. This patch deletes the unused variable and
fixes the check.

Reported-by: Ming Chun Li <macli@brc.ubc.ca>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
index 1a8a408..7df50e8 100755
--- a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -91,7 +91,6 @@ my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
 sub generate_traceevent_regex {
 	my $event = shift;
 	my $default = shift;
-	my @fields = @_;
 	my $regex;
 
 	# Read the event format or use the default
@@ -120,7 +119,8 @@ sub generate_traceevent_regex {
 			$regex =~ s/$key=\((.*)\)/$key=$1/;
 		}
 	}
-	if (defined $_) {
+
+	if (defined shift) {
 		die("Fewer fields than expected in format");
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints
  2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
                   ` (3 preceding siblings ...)
  2009-08-07 17:40 ` [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
  2009-08-07 17:40 ` [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints Mel Gorman
  5 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman

The documentation for ftrace, events and tracepoints is pretty
extensive. Similarly, the perf PCL tools help files --help are there and
the code simple enough to figure out what much of the switches mean.
However, pulling the discrete bits and pieces together and translating
that into "how do I solve a problem" requires a fair amount of
imagination.

This patch adds a simple document intended to get someone started on the
different ways of using tracepoints to gather meaningful data.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/tracepoint-analysis.txt |  327 +++++++++++++++++++++++++++
 1 files changed, 327 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/tracepoint-analysis.txt

diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 0000000..e7a7d3e
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
+		Notes on Analysing Behaviour Using Events and Tracepoints
+
+			Documentation written by Mel Gorman
+		PCL information heavily based on email from Ingo Molnar
+
+1. Introduction
+===============
+
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
+creating custom kernel modules to register probe functions using the event
+tracing infrastructure.
+
+Simplistically, tracepoints will represent an important event that when can
+be taken in conjunction with other tracepoints to build a "Big Picture" of
+what is going on within the system. There are a large number of methods for
+gathering and interpreting these events. Lacking any current Best Practises,
+this document describes some of the methods that can be used.
+
+This document assumes that debugfs is mounted on /sys/kernel/debug and that
+the appropriate tracing options have been configured into the kernel. It is
+assumed that the PCL tool tools/perf has been installed and is in your path.
+
+2. Listing Available Events
+===========================
+
+2.1 Standard Utilities
+----------------------
+
+All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+calling
+
+  $ find /sys/kernel/debug/tracing/events -type d
+
+will give a fair indication of the number of events available.
+
+2.2 PCL
+-------
+
+Discovery and enumeration of all counters and events, including tracepoints
+are available with the perf tool. Getting a list of available events is a
+simple case of
+
+  $ perf list 2>&1 | grep Tracepoint
+  ext4:ext4_free_inode                     [Tracepoint event]
+  ext4:ext4_request_inode                  [Tracepoint event]
+  ext4:ext4_allocate_inode                 [Tracepoint event]
+  ext4:ext4_write_begin                    [Tracepoint event]
+  ext4:ext4_ordered_write_end              [Tracepoint event]
+  [ .... remaining output snipped .... ]
+
+
+2. Enabling Events
+==================
+
+2.1 System-Wide Event Enabling
+------------------------------
+
+See Documentation/trace/events.txt for a proper description on how events
+can be enabled system-wide. A short example of enabling all events related
+to page allocation would look something like
+
+  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+
+2.2 System-Wide Event Enabling with SystemTap
+---------------------------------------------
+
+In SystemTap, tracepoints are accessible using the kernel.trace() function
+call. The following is an example that reports every 5 seconds what processes
+were allocating the pages.
+
+  global page_allocs
+
+  probe kernel.trace("mm_page_alloc") {
+  	page_allocs[execname()]++
+  }
+
+  function print_count() {
+  	printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
+  	foreach (proc in page_allocs-)
+  		printf("%-25d %s\n", page_allocs[proc], proc)
+  	printf ("\n")
+  	delete page_allocs
+  }
+
+  probe timer.s(5) {
+          print_count()
+  }
+
+2.3 System-Wide Event Enabling with PCL
+---------------------------------------
+
+By specifying the -a switch and analysing sleep, the system-wide events
+for a duration of time can be examined.
+
+ $ perf stat -a \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	sleep 10
+ Performance counter stats for 'sleep 10':
+
+           9630  kmem:mm_page_alloc      
+           2143  kmem:mm_page_free_direct
+           7424  kmem:mm_pagevec_free    
+
+   10.002577764  seconds time elapsed
+
+Similarly, one could execute a shell and exit it as desired to get a report
+at that point.
+
+2.4 Local Event Enabling
+------------------------
+
+Documentation/trace/ftrace.txt describes how to enable events on a per-thread
+basis using set_ftrace_pid.
+
+2.5 Local Event Enablement with PCL
+-----------------------------------
+
+Events can be activate and tracked for the duration of a process on a local
+basis using PCL such as follows.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		 -e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.909
+
+    Performance counter stats for './hackbench 10':
+
+          17803  kmem:mm_page_alloc      
+          12398  kmem:mm_page_free_direct
+           4827  kmem:mm_pagevec_free    
+
+    0.973913387  seconds time elapsed
+
+3. Event Filtering
+==================
+
+Documentation/trace/ftrace.txt covers in-depth how to filter events in
+ftrace.  Obviously using grep and awk of trace_pipe is an option as well
+as any script reading trace_pipe.
+
+4. Analysing Event Variances with PCL
+=====================================
+
+Any workload can exhibit variances between runs and it can be important
+to know what the standard deviation in. By and large, this is left to the
+performance analyst to do it by hand. In the event that the discrete event
+occurrences are useful to the performance analyst, then perf can be used.
+
+  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
+			-e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.890
+  Time: 0.895
+  Time: 0.915
+  Time: 1.001
+  Time: 0.899
+
+   Performance counter stats for './hackbench 10' (5 runs):
+
+          16630  kmem:mm_page_alloc         ( +-   3.542% )
+          11486  kmem:mm_page_free_direct   ( +-   4.771% )
+           4730  kmem:mm_pagevec_free       ( +-   2.325% )
+
+    0.982653002  seconds time elapsed   ( +-   1.448% )
+
+In the event that some higher-level event is required that depends on some
+aggregation of discrete events, then a script would need to be developed.
+
+Using --repeat, it is also possible to view how events are fluctuating over
+time on a system wide basis using -a and sleep.
+
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-a --repeat 10 \
+		sleep 1
+  Performance counter stats for 'sleep 1' (10 runs):
+
+           1066  kmem:mm_page_alloc         ( +-  26.148% )
+            182  kmem:mm_page_free_direct   ( +-   5.464% )
+            890  kmem:mm_pagevec_free       ( +-  30.079% )
+
+    1.002251757  seconds time elapsed   ( +-   0.005% )
+
+5. Higher-Level Analysis with Helper Scripts
+============================================
+
+When events are enabled the events that are triggering can be read from
+/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+options exist as well. By post-processing the output, further information can
+be gathered on-line as appropriate. Examples of post-processing might include
+
+  o Reading information from /proc for the PID that triggered the event
+  o Deriving a higher-level event from a series of lower-level events.
+  o Calculate latencies between two events
+
+Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
+script that can read trace_pipe from STDIN or a copy of a trace. When used
+on-line, it can be interrupted once to generate a report without existing
+and twice to exit.
+
+Simplistically, the script just reads STDIN and counts up events but it
+also can do more such as
+
+  o Derive high-level events from many low-level events. If a number of pages
+    are freed to the main allocator from the per-CPU lists, it recognises
+    that as one per-CPU drain even though there is no specific tracepoint
+    for that event
+  o It can aggregate based on PID or individual process number
+  o In the event memory is getting externally fragmented, it reports
+    on whether the fragmentation event was severe or moderate.
+  o When receiving an event about a PID, it can record who the parent was so
+    that if large numbers of events are coming from very short-lived
+    processes, the parent process responsible for creating all the helpers
+    can be identified
+
+6. Lower-Level Analysis with PCL
+================================
+
+There may also be a requirement to identify what functions with a program
+were generating events within the kernel. To begin this sort of analysis, the
+data must be recorded. At the time of writing, this required root
+
+  $ perf record -c 1 \
+	-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+	-e kmem:mm_pagevec_free \
+	./hackbench 10
+  Time: 0.894
+  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
+
+Note the use of '-c 1' to set the event period to sample. The default sample
+period is quite high to minimise overhead but the information collected can be
+very coarse as a result.
+
+This record outputted a file called perf.data which can be analysed using
+perf report.
+
+  $ perf report
+  # Samples: 30922
+  #
+  # Overhead    Command                     Shared Object
+  # ........  .........  ................................
+  #
+      87.27%  hackbench  [vdso]                          
+       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so      
+       2.62%  hackbench  /lib/ld-2.9.so                  
+       1.52%       perf  [vdso]                          
+       1.22%  hackbench  ./hackbench                     
+       0.48%  hackbench  [kernel]                        
+       0.02%       perf  /lib/i686/cmov/libc-2.9.so      
+       0.01%       perf  /usr/bin/perf                   
+       0.01%       perf  /lib/ld-2.9.so                  
+       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+According to this, the vast majority of events occured triggered on events
+within the VDSO. With simple binaries, this will often be the case so lets
+take a slightly different example. In the course of writing this, it was
+noticed that X was generating an insane amount of page allocations so lets look
+at it
+
+  $ perf record -c 1 -f \
+		-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+		-e kmem:mm_pagevec_free \
+		-p `pidof X`
+
+This was interrupted after a few seconds and
+
+  $ perf report
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object
+  # ........  .......  .......................................
+  #
+      51.95%     Xorg  [vdso]                                 
+      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so             
+       0.01%     Xorg  [kernel]                               
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+
+So, almost half of the events are occuring in a library. To get an idea which
+symbol.
+
+  $ perf report --sort comm,dso,symbol
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object  Symbol
+  # ........  .......  .......................................  ......
+  #
+      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
+      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
+       0.01%     Xorg  [kernel]                                 [k] read_hpet
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
+       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
+
+To see where within the function pixmanFillsse2 things are going wrong
+
+  $ perf annotate pixmanFillsse2
+  [ ... ]
+    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
+         :      }
+         :
+         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
+         :        *__P = __B;
+   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
+    0.00 :         34ef5:       ff 
+   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
+    0.00 :         34efd:       ff 
+   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
+    0.00 :         34f05:       ff 
+   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
+    0.00 :         34f0d:       ff 
+   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
+   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
+   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
+   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
+
+At a glance, it looks like the time is being spent copying pixmaps to
+the card.  Further investigation would be needed to determine why pixmaps
+are being copied around so much but a starting point would be to take an
+ancient build of libpixmap out of the library path where it was totally
+forgotten about from months ago!
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints
  2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
                   ` (4 preceding siblings ...)
  2009-08-07 17:40 ` [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints Mel Gorman
@ 2009-08-07 17:40 ` Mel Gorman
  5 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2009-08-07 17:40 UTC (permalink / raw)
  To: Larry Woodman, Ingo Molnar, Andrew Morton
  Cc: riel, Peter Zijlstra, LKML, linux-mm, Mel Gorman

Knowing tracepoints exist is not quite the same as knowing what they
should be used for. This patch adds a document giving a basic
description of the kmem tracepoints and why they might be useful to a
performance analyst.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/trace/events-kmem.txt |  107 +++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/events-kmem.txt

diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 0000000..6ef2a86
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
+			Subsystem Trace Points: kmem
+
+The tracing system kmem captures events related to object and page allocation
+within the kernel. Broadly speaking there are four major subheadings.
+
+  o Slab allocation of small objects of unknown type (kmalloc)
+  o Slab allocation of small objects of known type
+  o Page allocation
+  o Per-CPU Allocator Activity
+  o External Fragmentation
+
+This document will describe what each of the tracepoints are and why they
+might be useful.
+
+1. Slab allocation of small objects of unknown type
+===================================================
+kmalloc		call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmalloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kfree		call_site=%lx ptr=%p
+
+Heavy activity for these events may indicate that a specific cache is
+justified, particularly if kmalloc slab pages are getting significantly
+internal fragmented as a result of the allocation pattern. By correlating
+kmalloc with kfree, it may be possible to identify memory leaks and where
+the allocation sites were.
+
+
+2. Slab allocation of small objects of known type
+=================================================
+kmem_cache_alloc	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmem_cache_alloc_node	call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kmem_cache_free		call_site=%lx ptr=%p
+
+These events are similar in usage to the kmalloc-related events except that
+it is likely easier to pin the event down to a specific cache. At the time
+of writing, no information is available on what slab is being allocated from,
+but the call_site can usually be used to extrapolate that information
+
+3. Page allocation
+==================
+mm_page_alloc		  page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_free_direct	  page=%p pfn=%lu order=%d
+mm_pagevec_free		  page=%p pfn=%lu order=%d cold=%d
+
+These four events deal with page allocation and freeing. mm_page_alloc is
+a simple indicator of page allocator activity. Pages may be allocated from
+the per-CPU allocator (high performance) or the buddy allocator.
+
+If pages are allocated directly from the buddy allocator, the
+mm_page_alloc_zone_locked event is triggered. This event is important as high
+amounts of activity imply high activity on the zone->lock. Taking this lock
+impairs performance by disabling interrupts, dirtying cache lines between
+CPUs and serialising many CPUs.
+
+When a page is freed directly by the caller, the mm_page_free_direct event
+is triggered. Significant amounts of activity here could indicate that the
+callers should be batching their activities.
+
+When pages are freed using a pagevec, the mm_pagevec_free is
+triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
+freed in batch with a pagevec. Significant amounts of activity here could
+indicate that the system is under memory pressure and can also indicate
+contention on the zone->lru_lock.
+
+4. Per-CPU Allocator Activity
+=============================
+mm_page_alloc_zone_locked	page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_pcpu_drain		page=%p pfn=%lu order=%d cpu=%d migratetype=%d
+
+In front of the page allocator is a per-cpu page allocator. It exists only
+for order-0 pages, reduces contention on the zone->lock and reduces the
+amount of writing on struct page.
+
+When a per-CPU list is empty or pages of the wrong type are allocated,
+the zone->lock will be taken once and the per-CPU list refilled. The event
+triggered is mm_page_alloc_zone_locked for each page allocated with the
+event indicating whether it is for a percpu_refill or not.
+
+When the per-CPU list is too full, a number of pages are freed, each one
+which triggers a mm_page_pcpu_drain event.
+
+The individual nature of the events are so that pages can be tracked
+between allocation and freeing. A number of drain or refill pages that occur
+consecutively imply the zone->lock being taken once. Large amounts of PCP
+refills and drains could imply an imbalance between CPUs where too much work
+is being concentrated in one place. It could also indicate that the per-CPU
+lists should be a larger size. Finally, large amounts of refills on one CPU
+and drains on another could be a factor in causing large amounts of cache
+line bounces due to writes between CPUs and worth investigating if pages
+can be allocated and freed on the same CPU through some algorithm change.
+
+5. External Fragmentation
+=========================
+mm_page_alloc_extfrag		page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
+
+External fragmentation affects whether a high-order allocation will be
+successful or not. For some types of hardware, this is important although
+it is avoided where possible. If the system is using huge pages and needs
+to be able to resize the pool over the lifetime of the system, this value
+is important.
+
+Large numbers of this event implies that memory is fragmenting and
+high-order allocations will start failing at some time in the future. One
+means of reducing the occurange of this event is to increase the size of
+min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
+pageblock_size is usually the size of the default hugepage size.
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-08-10  7:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-07 17:40 [PATCH 0/6] Add some trace events for the page allocator v5 Mel Gorman
2009-08-07 17:40 ` [PATCH 1/6] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
2009-08-08  5:47   ` KOSAKI Motohiro
2009-08-07 17:40 ` [PATCH 2/6] tracing, page-allocator: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
2009-08-07 17:40 ` [PATCH 3/6] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
2009-08-07 17:40 ` [PATCH 4/6] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
2009-08-07 19:10   ` Li, Ming Chun
2009-08-07 19:32     ` Li, Ming Chun
2009-08-10  7:41       ` Mel Gorman
2009-08-07 17:40 ` [PATCH 5/6] tracing, documentation: Add a document describing how to do some performance analysis with tracepoints Mel Gorman
2009-08-07 17:40 ` [PATCH 6/6] tracing, documentation: Add a document on the kmem tracepoints Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox