Re: [RFC] Event counters [1/3]: Basic counter functionality

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: linux-kernel@vger.kernel.org,
	Nick Piggin <nickpiggin@yahoo.com.au>,
	linux-mm@kvack.org, Andi Kleen <ak@suse.de>
Subject: Re: [RFC] Event counters [1/3]: Basic counter functionality
Date: Sat, 31 Dec 2005 04:46:15 -0200	[thread overview]
Message-ID: <20051231064615.GB11069@dmt.cnet> (raw)
In-Reply-To: <20051220235733.30925.55642.sendpatchset@schroedinger.engr.sgi.com>

Hi Christoph,

On Tue, Dec 20, 2005 at 03:57:33PM -0800, Christoph Lameter wrote:
> Light weight counter functions
> 
> The remaining counters in page_state after the zoned VM counter patch has been
> applied are all just for show in /proc/vmstat. They have no essential function
> for the VM and therefore we can make these counters lightweight by ignoring
> races and also allow an off switch for embedded systems that allows a building
> of a linux kernels without these counters.
> 
> The implementation of these counters is through inline code that typically
> results in a simple increment of a global memory locations.
> 
> Also
> - Rename page_state to event_state
> - Make event state an array indexed by the event item.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> Index: linux-2.6.15-rc5-mm3/init/Kconfig
> ===================================================================
> --- linux-2.6.15-rc5-mm3.orig/init/Kconfig	2005-12-16 11:44:09.000000000 -0800
> +++ linux-2.6.15-rc5-mm3/init/Kconfig	2005-12-20 14:15:23.000000000 -0800
> @@ -411,6 +411,15 @@ config SLAB
>  	  SLOB is more space efficient but does not scale well and is
>  	  more susceptible to fragmentation.
>  
> +config EVENT_COUNTERS

Please rename to VM_EVENT_COUNTERS or a better name.

> +	default y
> +	bool "Enable event counters for /proc/vmstat" if EMBEDDED
> +	help
> +	  Event counters are only needed to display statistics. They
> +	  have no function for the kernel itself. This option allows
> +	  the disabling of the event counters. /proc/vmstat will only
> +	  contain essential counters.
> +
>  config SERIAL_PCI
>  	depends PCI && SERIAL_8250
>  	default y
> Index: linux-2.6.15-rc5-mm3/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.15-rc5-mm3.orig/include/linux/page-flags.h	2005-12-20 13:15:45.000000000 -0800
> +++ linux-2.6.15-rc5-mm3/include/linux/page-flags.h	2005-12-20 14:55:00.000000000 -0800
> @@ -77,120 +77,70 @@
>  #define PG_nosave_free		18	/* Free, should not be written */
>  #define PG_uncached		19	/* Page has been mapped as uncached */
>  
> +#ifdef CONFIG_EVENT_COUNTERS
>  /*
> - * Global page accounting.  One instance per CPU.  Only unsigned longs are
> - * allowed.
> + * Light weight per cpu counter implementation.
>   *
> - * - Fields can be modified with xxx_page_state and xxx_page_state_zone at
> - * any time safely (which protects the instance from modification by
> - * interrupt.
> - * - The __xxx_page_state variants can be used safely when interrupts are
> - * disabled.
> - * - The __xxx_page_state variants can be used if the field is only
> - * modified from process context, or only modified from interrupt context.
> - * In this case, the field should be commented here.
> + * Note that these can race. We do not bother to enable preemption
> + * or care about interrupt races. All we care about is to have some
> + * approximate count of events.

What about this addition to the documentation above, to make it a little more 
verbose:

	The possible race scenario is restricted to kernel preemption,
	and could happen as follows:

	thread A				thread B
a)	movl    xyz(%ebp), %eax			movl    xyz(%ebp), %eax
b)	incl    %eax				incl    %eax
c)	movl    %eax, xyz(%ebp)			movl    %eax, xyz(%ebp)

Thread A can be preempted in b), and thread B succesfully increments the
counter, writing it back to memory. Now thread A resumes execution, with
its stale copy of the counter, and overwrites the current counter.

Resulting in increments lost.

However that should be relatively rare condition.

> + *
> + * Counters should only be incremented and no critical kernel component
> + * should rely on the counter values.
> + *
> + * Counters are handled completely inline. On many platforms the code
> + * generated will simply be the increment of a global address.
>   */
> -struct page_state {
> -	/*
> -	 * The below are zeroed by get_page_state().  Use get_full_page_state()
> -	 * to add up all these.
> -	 */
> -	unsigned long pgpgin;		/* Disk reads */
> -	unsigned long pgpgout;		/* Disk writes */
> -	unsigned long pswpin;		/* swap reads */
> -	unsigned long pswpout;		/* swap writes */
> -
> -	unsigned long pgalloc_high;	/* page allocations */
> -	unsigned long pgalloc_normal;
> -	unsigned long pgalloc_dma32;
> -	unsigned long pgalloc_dma;
> -
> -	unsigned long pgfree;		/* page freeings */
> -	unsigned long pgactivate;	/* pages moved inactive->active */
> -	unsigned long pgdeactivate;	/* pages moved active->inactive */
> -
> -	unsigned long pgfault;		/* faults (major+minor) */
> -	unsigned long pgmajfault;	/* faults (major only) */
> -
> -	unsigned long pgrefill_high;	/* inspected in refill_inactive_zone */
> -	unsigned long pgrefill_normal;
> -	unsigned long pgrefill_dma32;
> -	unsigned long pgrefill_dma;
> -
> -	unsigned long pgsteal_high;	/* total highmem pages reclaimed */
> -	unsigned long pgsteal_normal;
> -	unsigned long pgsteal_dma32;
> -	unsigned long pgsteal_dma;
> -
> -	unsigned long pgscan_kswapd_high;/* total highmem pages scanned */
> -	unsigned long pgscan_kswapd_normal;
> -	unsigned long pgscan_kswapd_dma32;
> -	unsigned long pgscan_kswapd_dma;
> -
> -	unsigned long pgscan_direct_high;/* total highmem pages scanned */
> -	unsigned long pgscan_direct_normal;
> -	unsigned long pgscan_direct_dma32;
> -	unsigned long pgscan_direct_dma;
> -
> -	unsigned long pginodesteal;	/* pages reclaimed via inode freeing */
> -	unsigned long slabs_scanned;	/* slab objects scanned */
> -	unsigned long kswapd_steal;	/* pages reclaimed by kswapd */
> -	unsigned long kswapd_inodesteal;/* reclaimed via kswapd inode freeing */
> -	unsigned long pageoutrun;	/* kswapd's calls to page reclaim */
> -	unsigned long allocstall;	/* direct reclaim calls */
> +#define FOR_ALL_ZONES(x) x##_DMA, x##_DMA32, x##_NORMAL, x##_HIGH
>  
> -	unsigned long pgrotated;	/* pages rotated to tail of the LRU */
> +enum event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> +		FOR_ALL_ZONES(PGALLOC),
> +		PGFREE, PGACTIVATE, PGDEACTIVATE,
> +		PGFAULT, PGMAJFAULT,
> + 		FOR_ALL_ZONES(PGREFILL),
> + 		FOR_ALL_ZONES(PGSTEAL),
> +		FOR_ALL_ZONES(PGSCAN_KSWAPD),
> +		FOR_ALL_ZONES(PGSCAN_DIRECT),
> +		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> +		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> +		NR_EVENT_ITEMS
>  };
>  
> -extern void get_full_page_state(struct page_state *ret);
> -extern unsigned long read_page_state_offset(unsigned long offset);
> -extern void mod_page_state_offset(unsigned long offset, unsigned long delta);
> -extern void __mod_page_state_offset(unsigned long offset, unsigned long delta);
> -
> -#define read_page_state(member) \
> -	read_page_state_offset(offsetof(struct page_state, member))
> -
> -#define mod_page_state(member, delta)	\
> -	mod_page_state_offset(offsetof(struct page_state, member), (delta))
> -
> -#define __mod_page_state(member, delta)	\
> -	__mod_page_state_offset(offsetof(struct page_state, member), (delta))
> -
> -#define inc_page_state(member)		mod_page_state(member, 1UL)
> -#define dec_page_state(member)		mod_page_state(member, 0UL - 1)
> -#define add_page_state(member,delta)	mod_page_state(member, (delta))
> -#define sub_page_state(member,delta)	mod_page_state(member, 0UL - (delta))
> -
> -#define __inc_page_state(member)	__mod_page_state(member, 1UL)
> -#define __dec_page_state(member)	__mod_page_state(member, 0UL - 1)
> -#define __add_page_state(member,delta)	__mod_page_state(member, (delta))
> -#define __sub_page_state(member,delta)	__mod_page_state(member, 0UL - (delta))
> -
> -#define page_state(member) (*__page_state(offsetof(struct page_state, member)))
> -
> -#define state_zone_offset(zone, member)					\
> -({									\
> -	unsigned offset;						\
> -	if (is_highmem(zone))						\
> -		offset = offsetof(struct page_state, member##_high);	\
> -	else if (is_normal(zone))					\
> -		offset = offsetof(struct page_state, member##_normal);	\
> -	else if (is_dma32(zone))					\
> -		offset = offsetof(struct page_state, member##_dma32);	\
> -	else								\
> -		offset = offsetof(struct page_state, member##_dma);	\
> -	offset;								\
> -})
> -
> -#define __mod_page_state_zone(zone, member, delta)			\
> - do {									\
> -	__mod_page_state_offset(state_zone_offset(zone, member), (delta)); \
> - } while (0)
> -
> -#define mod_page_state_zone(zone, member, delta)			\
> - do {									\
> -	mod_page_state_offset(state_zone_offset(zone, member), (delta)); \
> - } while (0)
> +struct event_state {
> +	unsigned long event[NR_EVENT_ITEMS];
> +};
> +
> +DECLARE_PER_CPU(struct event_state, event_states);

Might be interesting to mark this structure as __cacheline_aligned_in_smp to
avoid unrelated data sitting in the same line?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2005-12-31  6:46 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-12-20 23:57 Christoph Lameter
2005-12-20 23:57 ` [RFC] Event counters [2/3]: Convert inc_page_state -> count_event Christoph Lameter
2005-12-20 23:57 ` [RFC] Event counters [3/3]: Convert NUMA counters to event counters Christoph Lameter
2005-12-21 22:24   ` Christoph Lameter
2005-12-31  6:46 ` Marcelo Tosatti [this message]
2005-12-31  7:54   ` [RFC] Event counters [1/3]: Basic counter functionality Nick Piggin
2005-12-31 20:26     ` Marcelo Tosatti
2006-01-02 21:40       ` Marcelo Tosatti
2006-01-03  5:11         ` Nick Piggin
2006-01-03 10:11           ` Marcelo Tosatti
2006-01-03 13:21             ` Nick Piggin
2006-01-03 12:06               ` Marcelo Tosatti
2006-01-03  5:01       ` Nick Piggin
2006-01-03 19:08     ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20051231064615.GB11069@dmt.cnet \
    --to=marcelo.tosatti@cyclades.com \
    --cc=ak@suse.de \
    --cc=clameter@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nickpiggin@yahoo.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox