linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Greg Thelen <gthelen@google.com>,
	Christoph Hellwig <hch@infradead.org>,
	Hugh Dickins <hughd@google.com>, Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Mel Gorman <mgorman@suse.de>, Minchan Kim <minchan.kim@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@redhat.com>,
	Michel Lespinasse <walken@google.com>,
	Seth Jennings <sjenning@linux.vnet.ibm.com>,
	Roman Gushchin <klamm@yandex-team.ru>,
	Ozgun Erdogan <ozgun@citusdata.com>,
	Metin Doslu <metin@citusdata.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [patch 9/9] mm: workingset: keep shadow entries in check
Date: Sat, 17 Aug 2013 15:31:23 -0400	[thread overview]
Message-ID: <1376767883-4411-10-git-send-email-hannes@cmpxchg.org> (raw)
In-Reply-To: <1376767883-4411-1-git-send-email-hannes@cmpxchg.org>

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, a list of inodes that contain shadow
entries is maintained.  If the global number of shadows exceeds a
certain threshold, a shrinker is activated that reclaims old entries
from the mappings.  This is heavy-handed but it should not be a hot
path and is mainly there to protect from accidentally/maliciously
induced OOM kills.  The global list is also not a problem because the
modifications are very rare: inodes are added once in their lifetime
when the first shadow entry is stored (i.e. the first page reclaimed)
and lazily removed when the inode exits.  Or if the shrinker removes
all shadow entries.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c             |   5 +-
 include/linux/fs.h     |   1 +
 include/linux/mmzone.h |   1 +
 include/linux/swap.h   |   4 +
 mm/filemap.c           |   4 +-
 mm/truncate.c          |   2 +-
 mm/vmstat.c            |   1 +
 mm/workingset.c        | 196 +++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 208 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8862b1b..b23b141 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -169,6 +169,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->private_data = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+	workingset_init_mapping(mapping);
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
@@ -546,9 +547,7 @@ static void evict(struct inode *inode)
 	 */
 	inode_wait_for_writeback(inode);
 
-	spin_lock_irq(&inode->i_data.tree_lock);
-	mapping_set_exiting(&inode->i_data);
-	spin_unlock_irq(&inode->i_data.tree_lock);
+	workingset_exit_mapping(&inode->i_data);
 
 	if (op->evict_inode) {
 		op->evict_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ac5d84e..140ae2d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -417,6 +417,7 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	struct list_head	shadow_list;	/* list of mappings with shadows */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 56f540e..747adae 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -146,6 +146,7 @@ enum zone_stat_item {
 	WORKINGSET_STALE,
 	WORKINGSET_BALANCE,
 	WORKINGSET_BALANCE_FORCE,
+	WORKINGSET_SHADOWS_RECLAIMED,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 441845d..f18dd33eb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,9 +261,13 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
+void workingset_init_mapping(struct address_space *mapping);
+void workingset_exit_mapping(struct address_space *mapping);
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 void workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index ab4351e..d47d83e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -132,7 +132,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
 		radix_tree_replace_slot(slot, shadow);
-		mapping->nrshadows++;
+		workingset_shadows_inc(mapping);
 	} else
 		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
@@ -466,7 +466,7 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 5c85dd4..76064a4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -36,7 +36,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * need verification under the tree lock.
 	 */
 	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4fabaed..b0b8e3d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -743,6 +743,7 @@ const char * const vmstat_text[] = {
 	"workingset_stale",
 	"workingset_balance",
 	"workingset_balance_force",
+	"workingset_shadows_reclaimed",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 88ad7ea..00b2280e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -84,6 +84,46 @@
  * challenged without incurring major faults in case of a mistake.
  */
 
+static DEFINE_PER_CPU(unsigned long, nr_shadows);
+static DEFINE_SPINLOCK(shadow_lock);
+static LIST_HEAD(shadow_mappings);
+
+/**
+ * workingset_init_mapping - prepare address space for page reclaim
+ * @mapping: address space
+ *
+ * Must be called when the inode is instantiated, before any page
+ * cache is populated.
+ */
+void workingset_init_mapping(struct address_space *mapping)
+{
+	INIT_LIST_HEAD(&mapping->shadow_list);
+}
+
+/**
+ * workingset_exit_mapping - tell page reclaim address space is exiting
+ * @mapping: address space
+ *
+ * Must be called before the final truncate, to prevent page reclaim
+ * from installing shadow entries behind the back of the inode
+ * teardown process.
+ */
+void workingset_exit_mapping(struct address_space *mapping)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	mapping_set_exiting(mapping);
+	/*
+	 * Take the mapping off the shrinker list, the final truncate
+	 * is about to remove potentially remaining shadow entries.
+	 */
+	if (!list_empty(&mapping->shadow_list)) {
+		spin_lock(&shadow_lock);
+		list_del(&mapping->shadow_list);
+		spin_unlock(&shadow_lock);
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+}
+
 static void *pack_shadow(unsigned long time, struct zone *zone)
 {
 	time = (time << NODES_SHIFT) | zone_to_nid(zone);
@@ -198,3 +238,159 @@ void workingset_activation(struct page *page)
 	if (zone->shrink_active > 0)
 		zone->shrink_active--;
 }
+
+void workingset_shadows_inc(struct address_space *mapping)
+{
+	might_lock(&shadow_lock);
+
+	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
+		spin_lock(&shadow_lock);
+		list_add(&mapping->shadow_list, &shadow_mappings);
+		spin_unlock(&shadow_lock);
+	}
+
+	mapping->nrshadows++;
+	this_cpu_inc(nr_shadows);
+}
+
+void workingset_shadows_dec(struct address_space *mapping)
+{
+	mapping->nrshadows--;
+	this_cpu_dec(nr_shadows);
+	/*
+	 * shadow_mappings operations are costly, so we keep the
+	 * mapping linked here even without any shadows left and
+	 * unlink it lazily in the shadow shrinker or when the inode
+	 * is destroyed.
+	 */
+}
+
+static unsigned long get_nr_old_shadows(void)
+{
+	unsigned long nr_max;
+	unsigned long nr;
+	long sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		sum += per_cpu(nr_shadows, cpu);
+	nr = max(sum, 0L);
+
+	/*
+	 * Every shadow entry with a refault distance bigger than the
+	 * active list is ignored and so NR_ACTIVE_FILE would be a
+	 * reasonable ceiling.  But scanning and shrinking shadow
+	 * entries is quite expensive, so be generous.
+	 */
+	nr_max = global_dirtyable_memory() * 4;
+
+	if (nr <= nr_max)
+		return 0;
+	return nr - nr_max;
+}
+
+static unsigned long scan_mapping(struct address_space *mapping,
+				  unsigned long nr_to_scan)
+{
+	unsigned long nr_scanned = 0;
+	struct radix_tree_iter iter;
+	void **slot;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
+		unsigned long nrshadows;
+		unsigned long distance;
+		struct zone *zone;
+		struct page *page;
+
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (!radix_tree_exception(page))
+			continue;
+		if (radix_tree_deref_retry(page))
+			goto restart;
+
+		unpack_shadow(page, &zone, &distance);
+
+		if (distance <= zone_page_state(zone, NR_ACTIVE_FILE))
+			continue;
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (radix_tree_delete_item(&mapping->page_tree,
+					   iter.index, page)) {
+			inc_zone_state(zone, WORKINGSET_SHADOWS_RECLAIMED);
+			workingset_shadows_dec(mapping);
+			nr_scanned++;
+		}
+		nrshadows = mapping->nrshadows;
+		spin_unlock_irq(&mapping->tree_lock);
+
+		if (nrshadows == 0)
+			break;
+
+		if (--nr_to_scan == 0)
+			break;
+	}
+	rcu_read_unlock();
+
+	return nr_scanned;
+}
+
+static unsigned long count_shadows(struct shrinker *shrink,
+				   struct shrink_control *sc)
+{
+	return get_nr_old_shadows();
+}
+
+static unsigned long scan_shadows(struct shrinker *shrink,
+				  struct shrink_control *sc)
+{
+	unsigned long nr_to_scan = sc->nr_to_scan;
+
+	do {
+		struct address_space *mapping;
+
+		spin_lock_irq(&shadow_lock);
+		if (list_empty(&shadow_mappings)) {
+			spin_unlock_irq(&shadow_lock);
+			break;
+		}
+		mapping = list_entry(shadow_mappings.prev,
+				     struct address_space,
+				     shadow_list);
+		list_move(&mapping->shadow_list, &shadow_mappings);
+		__iget(mapping->host);
+		spin_unlock_irq(&shadow_lock);
+
+		if (mapping->nrshadows)
+			nr_to_scan -= scan_mapping(mapping, nr_to_scan);
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (mapping->nrshadows == 0) {
+			spin_lock(&shadow_lock);
+			list_del_init(&mapping->shadow_list);
+			spin_unlock(&shadow_lock);
+		}
+		spin_unlock_irq(&mapping->tree_lock);
+
+		iput(mapping->host);
+
+	} while (nr_to_scan && get_nr_old_shadows());
+
+	return sc->nr_to_scan - nr_to_scan;
+}
+
+static struct shrinker shadow_shrinker = {
+	.count_objects = count_shadows,
+	.scan_objects = scan_shadows,
+	.seeks = 1,
+};
+
+static int __init workingset_init(void)
+{
+	register_shrinker(&shadow_shrinker);
+	return 0;
+}
+core_initcall(workingset_init);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2013-08-17 19:32 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-17 19:31 [patch 9/9] mm: thrash detection-based file cache sizing v4 Johannes Weiner
2013-08-17 19:31 ` [patch 1/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-17 19:31 ` [patch 2/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2013-08-17 19:31 ` [patch 3/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2013-08-17 19:31 ` [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-22  7:20     ` Johannes Weiner
2013-08-17 19:31 ` [patch 5/9] mm + fs: store shadow entries in page cache Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-17 19:31 ` [patch 6/9] mm + fs: provide shadow pages to page cache allocations Johannes Weiner
2013-08-17 19:31 ` [patch 7/9] mm: make global_dirtyable_memory() available to other mm code Johannes Weiner
2013-08-17 19:31 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-22  8:45     ` Johannes Weiner
2013-08-17 19:31 ` Johannes Weiner [this message]
2013-08-20 20:59   ` [patch 9/9] mm: workingset: keep shadow entries in check Andrew Morton
2013-08-22  9:48     ` Johannes Weiner
2013-08-20 21:04 ` [patch 9/9] mm: thrash detection-based file cache sizing v4 Andrew Morton
2013-08-22  9:08   ` Metin Doslu
  -- strict thread matches above, loose matches on Subject: below --
2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
2013-08-06 22:44 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
2013-08-11 23:56   ` Andi Kleen
2013-08-14 14:41     ` Johannes Weiner
2013-08-06 22:22 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
2013-08-06 22:22 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1376767883-4411-10-git-send-email-hannes@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=gthelen@google.com \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=klamm@yandex-team.ru \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=metin@citusdata.com \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=ozgun@citusdata.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=sjenning@linux.vnet.ibm.com \
    --cc=vbabka@suse.cz \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox