linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Greg Thelen <gthelen@google.com>,
	Christoph Hellwig <hch@infradead.org>,
	Hugh Dickins <hughd@google.com>, Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Mel Gorman <mgorman@suse.de>, Minchan Kim <minchan.kim@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@redhat.com>,
	Michel Lespinasse <walken@google.com>,
	Seth Jennings <sjenning@linux.vnet.ibm.com>,
	Roman Gushchin <klamm@yandex-team.ru>,
	Ozgun Erdogan <ozgun@citusdata.com>,
	Metin Doslu <metin@citusdata.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [patch 9/9] mm: workingset: keep shadow entries in check
Date: Thu, 22 Aug 2013 11:48:53 +0200	[thread overview]
Message-ID: <20130822094853.GC26749@cmpxchg.org> (raw)
In-Reply-To: <20130820135924.937d93a3fd0368b48ba01189@linux-foundation.org>

On Tue, Aug 20, 2013 at 01:59:24PM -0700, Andrew Morton wrote:
> On Sat, 17 Aug 2013 15:31:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> 
> erk.  This whole patch is overhead :(
> 
> > To get this under control, a list of inodes that contain shadow
> > entries is maintained.  If the global number of shadows exceeds a
> > certain threshold, a shrinker is activated that reclaims old entries
> > from the mappings.  This is heavy-handed but it should not be a hot
> > path and is mainly there to protect from accidentally/maliciously
> > induced OOM kills.  The global list is also not a problem because the
> > modifications are very rare: inodes are added once in their lifetime
> > when the first shadow entry is stored (i.e. the first page reclaimed)
> > and lazily removed when the inode exits.  Or if the shrinker removes
> > all shadow entries.
> > 
> > ...
> >
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -417,6 +417,7 @@ struct address_space {
> >  	/* Protected by tree_lock together with the radix tree */
> >  	unsigned long		nrpages;	/* number of total pages */
> >  	unsigned long		nrshadows;	/* number of shadow entries */
> > +	struct list_head	shadow_list;	/* list of mappings with shadows */
> >  	pgoff_t			writeback_index;/* writeback starts here */
> >  	const struct address_space_operations *a_ops;	/* methods */
> >  	unsigned long		flags;		/* error bits/gfp mask */
> 
> There's another 16 bytes into the inode.  Bad.

Yeah :(

An obvious alternative to storing page eviction information in the
page cache radix tree would be to have a separate data structure that
scales with the number of physical pages available.

It really depends on the workload which one's cheaper in terms of both
memory and cpu.

Workloads where the ratio between number of inodes and inode size is
high suffer from the increased inode size, but when they have decent
access locality within the inodes, the inodes should be evicted along
with their pages.  So in this case there is little to no memory
overhead from the eviction information compared to the fixed size
separate data structure.

And refault information lookup is cheaper of course when storing
eviction information inside the cache slots.

Workloads for which this model sucks are those with inode locality but
no data locality.  The inodes stay alive and the lack of data locality
produces many shadow entries that only the costly shrinker can get rid
of.  Numbers aside, it was a judgement call to improve workloads with
high data locality at the cost of those without.

But I'll include a more concrete cost analysis in the submission that
also includes more concrete details on the benefits of the series ;)

> > +void workingset_shadows_inc(struct address_space *mapping)
> > +{
> > +	might_lock(&shadow_lock);
> > +
> > +	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
> > +		spin_lock(&shadow_lock);
> 
> I can't work out whether or not shadow_lock is supposed to be irq-save.
> Some places it is, others are unobvious.

It is.  The caller holds the irq-safe tree_lock, though, so no need to
disable IRQs a second time.  I'll add documentation.

> > +static unsigned long get_nr_old_shadows(void)
> > +{
> > +	unsigned long nr_max;
> > +	unsigned long nr;
> > +	long sum = 0;
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu)
> > +		sum += per_cpu(nr_shadows, cpu);
> 
> Ouch, slow.  shrink_slab() will call this repeatedly and scan_shadows()
> calls it from a loop.  Can we use something non-deathly-slow here? 
> Like percpu_counter_read_positive()?

Finally, a usecase for percpu_counter!!  Sounds reasonable, I'll
convert this stuff over.

> > +	nr = max(sum, 0L);
> > +
> > +	/*
> > +	 * Every shadow entry with a refault distance bigger than the
> > +	 * active list is ignored and so NR_ACTIVE_FILE would be a
> > +	 * reasonable ceiling.  But scanning and shrinking shadow
> > +	 * entries is quite expensive, so be generous.
> > +	 */
> > +	nr_max = global_dirtyable_memory() * 4;
> > +
> > +	if (nr <= nr_max)
> > +		return 0;
> > +	return nr - nr_max;
> > +}
> > +
> > +static unsigned long scan_mapping(struct address_space *mapping,
> > +				  unsigned long nr_to_scan)
> 
> Some methodological description would be useful.

Fair enough, I'll write something.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-08-22  9:51 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-17 19:31 [patch 9/9] mm: thrash detection-based file cache sizing v4 Johannes Weiner
2013-08-17 19:31 ` [patch 1/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-17 19:31 ` [patch 2/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2013-08-17 19:31 ` [patch 3/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2013-08-17 19:31 ` [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-22  7:20     ` Johannes Weiner
2013-08-17 19:31 ` [patch 5/9] mm + fs: store shadow entries in page cache Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-17 19:31 ` [patch 6/9] mm + fs: provide shadow pages to page cache allocations Johannes Weiner
2013-08-17 19:31 ` [patch 7/9] mm: make global_dirtyable_memory() available to other mm code Johannes Weiner
2013-08-17 19:31 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-22  8:45     ` Johannes Weiner
2013-08-17 19:31 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-22  9:48     ` Johannes Weiner [this message]
2013-08-20 21:04 ` [patch 9/9] mm: thrash detection-based file cache sizing v4 Andrew Morton
2013-08-22  9:08   ` Metin Doslu
  -- strict thread matches above, loose matches on Subject: below --
2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
2013-08-06 22:44 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
2013-08-11 23:56   ` Andi Kleen
2013-08-14 14:41     ` Johannes Weiner
2013-08-06 22:22 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
2013-08-06 22:22 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130822094853.GC26749@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=gthelen@google.com \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=klamm@yandex-team.ru \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=metin@citusdata.com \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=ozgun@citusdata.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=sjenning@linux.vnet.ibm.com \
    --cc=vbabka@suse.cz \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox