linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] mm: fix page cache convergence regression
Date: Thu, 30 May 2019 12:15:48 -0400	[thread overview]
Message-ID: <20190530161548.GA8415@cmpxchg.org> (raw)
In-Reply-To: <20190524173900.GA11702@cmpxchg.org>

Are there any objections or feedback on the proposed fix below? This
is kind of a serious regression.

On Fri, May 24, 2019 at 01:39:00PM -0400, Johannes Weiner wrote:
> From 63a0dbc571ff38f7c072c62d6bc28192debe37ac Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Fri, 24 May 2019 10:12:46 -0400
> Subject: [PATCH] mm: fix page cache convergence regression
> 
> Since a28334862993 ("page cache: Finish XArray conversion"), on most
> major Linux distributions, the page cache doesn't correctly transition
> when the hot data set is changing, and leaves the new pages thrashing
> indefinitely instead of kicking out the cold ones.
> 
> On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
> running stock Arch Linux:
> 
> [root@ham ~]# ./reclaimtest.sh
> + dd of=workingset-a bs=1M count=0 seek=600
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + ./mincore workingset-a
> 153600/153600 workingset-a
> + dd of=workingset-b bs=1M count=0 seek=600
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 104029/153600 workingset-a
> 120086/153600 workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 104029/153600 workingset-a
> 120268/153600 workingset-b
> 
> workingset-b is a 600M file on a 1G host that is otherwise entirely
> idle. No matter how often it's being accessed, it won't get cached.
> 
> While investigating, I noticed that the non-resident information gets
> aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
> a problem because a workingset transition like this relies on the
> non-resident information tracked in the page cache tree of evicted
> file ranges: when the cache faults are refaults of recently evicted
> cache, we challenge the existing active set, and that allows a new
> workingset to establish itself.
> 
> Tracing the shrinker that maintains this memory revealed that all page
> cache tree nodes were allocated to the root cgroup. This is a problem,
> because 1) the shrinker sizes the amount of non-resident information
> it keeps to the size of the cgroup's other memory and 2) on most major
> Linux distributions, only kernel threads live in the root cgroup and
> everything else gets put into services or session groups:
> 
> [root@ham ~]# cat /proc/self/cgroup
> 0::/user.slice/user-0.slice/session-c1.scope
> 
> As a result, we basically maintain no non-resident information for the
> workloads running on the system, thus breaking the caching algorithm.
> 
> Looking through the code, I found the culprit in the above-mentioned
> patch: when switching from the radix tree to xarray, it dropped the
> __GFP_ACCOUNT flag from the tree node allocations - the flag that
> makes sure the allocated memory gets charged to and tracked by the
> cgroup of the calling process - in this case, the one doing the fault.
> 
> To fix this, allow xarray users to specify per-tree flag that makes
> xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
> tree annotation to request such cgroup tracking for the cache nodes.
> 
> With this patch applied, the page cache correctly converges on new
> workingsets again after just a few iterations:
> 
> [root@ham ~]# ./reclaimtest.sh
> + dd of=workingset-a bs=1M count=0 seek=600
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + cat workingset-a
> + ./mincore workingset-a
> 153600/153600 workingset-a
> + dd of=workingset-b bs=1M count=0 seek=600
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 124607/153600 workingset-a
> 87876/153600 workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 81313/153600 workingset-a
> 133321/153600 workingset-b
> + cat workingset-b
> + ./mincore workingset-a workingset-b
> 63036/153600 workingset-a
> 153600/153600 workingset-b
> 
> Cc: stable@vger.kernel.org # 4.20+
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/inode.c             |  2 +-
>  include/linux/xarray.h |  1 +
>  lib/xarray.c           | 12 ++++++++++--
>  3 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index e9d18b2c3f91..cd67859dbaf1 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -361,7 +361,7 @@ EXPORT_SYMBOL(inc_nlink);
>  
>  static void __address_space_init_once(struct address_space *mapping)
>  {
> -	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ);
> +	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
>  	init_rwsem(&mapping->i_mmap_rwsem);
>  	INIT_LIST_HEAD(&mapping->private_list);
>  	spin_lock_init(&mapping->private_lock);
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index 0e01e6129145..5921599b6dc4 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -265,6 +265,7 @@ enum xa_lock_type {
>  #define XA_FLAGS_TRACK_FREE	((__force gfp_t)4U)
>  #define XA_FLAGS_ZERO_BUSY	((__force gfp_t)8U)
>  #define XA_FLAGS_ALLOC_WRAPPED	((__force gfp_t)16U)
> +#define XA_FLAGS_ACCOUNT	((__force gfp_t)32U)
>  #define XA_FLAGS_MARK(mark)	((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \
>  						(__force unsigned)(mark)))
>  
> diff --git a/lib/xarray.c b/lib/xarray.c
> index 6be3acbb861f..446b956c9188 100644
> --- a/lib/xarray.c
> +++ b/lib/xarray.c
> @@ -298,6 +298,8 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
>  		xas_destroy(xas);
>  		return false;
>  	}
> +	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> +		gfp |= __GFP_ACCOUNT;
>  	xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
>  	if (!xas->xa_alloc)
>  		return false;
> @@ -325,6 +327,8 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
>  		xas_destroy(xas);
>  		return false;
>  	}
> +	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> +		gfp |= __GFP_ACCOUNT;
>  	if (gfpflags_allow_blocking(gfp)) {
>  		xas_unlock_type(xas, lock_type);
>  		xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
> @@ -358,8 +362,12 @@ static void *xas_alloc(struct xa_state *xas, unsigned int shift)
>  	if (node) {
>  		xas->xa_alloc = NULL;
>  	} else {
> -		node = kmem_cache_alloc(radix_tree_node_cachep,
> -					GFP_NOWAIT | __GFP_NOWARN);
> +		gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
> +
> +		if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
> +			gfp |= __GFP_ACCOUNT;
> +
> +		node = kmem_cache_alloc(radix_tree_node_cachep, gfp);
>  		if (!node) {
>  			xas_set_err(xas, -ENOMEM);
>  			return NULL;
> -- 
> 2.21.0
> 


  parent reply	other threads:[~2019-05-30 16:15 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-24 15:31 Johannes Weiner
2019-05-24 16:04 ` Matthew Wilcox
2019-05-24 17:39   ` Johannes Weiner
2019-05-24 18:04     ` Shakeel Butt
2019-05-30 16:15     ` Johannes Weiner [this message]
2019-05-30 17:13       ` Matthew Wilcox
2019-05-30 17:58         ` Johannes Weiner
2019-06-24 15:19         ` Johannes Weiner
2019-06-24 21:52           ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190530161548.GA8415@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox