linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov@parallels.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Michal Hocko <mhocko@kernel.org>, Tejun Heo <tj@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: [PATCH 1/2] mm/slab: skip memcg reclaim only if in atomic context
Date: Sun, 30 Aug 2015 22:02:17 +0300	[thread overview]
Message-ID: <b34f641e1ba70aeb8754188a2dfa6c2d641e4cdc.1440960578.git.vdavydov@parallels.com> (raw)
In-Reply-To: <cover.1440960578.git.vdavydov@parallels.com>

SLAB's implementation of kmem_cache_alloc() works as follows:
 1. First, it tries to allocate from the preferred NUMA node without
    issuing reclaim.
 2. If step 1 fails, it tries all nodes in the order of preference,
    again without invoking reclaimer
 3. Only if steps 1 and 2 fails, it falls back on allocation from any
    allowed node with reclaim enabled.

Before commit 4167e9b2cf10f ("mm: remove GFP_THISNODE"), GFP_THISNODE
combination, which equaled __GFP_THISNODE|__GFP_NOWARN|__GFP_NORETRY on
NUMA enabled builds, was used in order to avoid reclaim during steps 1
and 2. If __alloc_pages_slowpath() saw this combination in gfp flags, it
aborted immediately even if __GFP_WAIT flag was set. So there was no
need in clearing __GFP_WAIT flag while performing steps 1 and 2 and
hence we could invoke memcg reclaim when allocating a slab page if the
context allowed.

Commit 4167e9b2cf10f zapped GFP_THISNODE combination. Instead of OR-ing
the gfp mask with GFP_THISNODE, gfp_exact_node() helper should now be
used. The latter sets __GFP_THISNODE and __GFP_NOWARN flags and clears
__GFP_WAIT on the current gfp mask. As a result, it effectively
prohibits invoking memcg reclaim on steps 1 and 2. This breaks
memcg/memory.high logic when kmem accounting is enabled. The memory.high
threshold is supposed to work as a soft limit, i.e. it does not fail an
allocation on breaching it, but it still forces the caller to invoke
direct reclaim to compensate for the excess. Without __GFP_WAIT flag
direct reclaim is impossible so the caller will go on without being
pushed back to the threshold.

To fix this issue, we get rid of gfp_exact_node() helper and move gfp
flags filtering to kmem_getpages() after memcg_charge_slab() is called.

To understand the patch, note that:
 - In fallback_alloc() the only effect of using gfp_exact_node() is
   preventing recursion fallback_alloc() -> ____cache_alloc_node() ->
   fallback_alloc().
 - Aside from fallback_alloc(), gfp_exact_node() is only used along with
   cache_grow(). Moreover, the only place where cache_grow() is used
   without it is fallback_alloc(), which, in contrast to other
   cache_grow() users, preallocates a page and passes it to cache_grow()
   so that the latter does not need to invoke kmem_getpages() by itself.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slab.c | 32 +++++++++++---------------------
 1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index d890750ec31e..9ee809d2ed8b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -857,11 +857,6 @@ static inline void *____cache_alloc_node(struct kmem_cache *cachep,
 	return NULL;
 }
 
-static inline gfp_t gfp_exact_node(gfp_t flags)
-{
-	return flags;
-}
-
 #else	/* CONFIG_NUMA */
 
 static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int);
@@ -1028,15 +1023,6 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
 
 	return __cache_free_alien(cachep, objp, node, page_node);
 }
-
-/*
- * Construct gfp mask to allocate from a specific node but do not invoke reclaim
- * or warn about failures.
- */
-static inline gfp_t gfp_exact_node(gfp_t flags)
-{
-	return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
-}
 #endif
 
 /*
@@ -1583,7 +1569,7 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t gfpflags, int nodeid)
  * would be relatively rare and ignorable.
  */
 static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
-								int nodeid)
+				  int nodeid, bool fallback)
 {
 	struct page *page;
 	int nr_pages;
@@ -1595,6 +1581,9 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 	if (memcg_charge_slab(cachep, flags, cachep->gfporder))
 		return NULL;
 
+	if (!fallback)
+		flags = (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
+
 	page = __alloc_pages_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
 	if (!page) {
 		memcg_uncharge_slab(cachep, cachep->gfporder);
@@ -2641,7 +2630,8 @@ static int cache_grow(struct kmem_cache *cachep,
 	 * 'nodeid'.
 	 */
 	if (!page)
-		page = kmem_getpages(cachep, local_flags, nodeid);
+		page = kmem_getpages(cachep, local_flags, nodeid,
+				     !IS_ENABLED(CONFIG_NUMA));
 	if (!page)
 		goto failed;
 
@@ -2840,7 +2830,7 @@ alloc_done:
 	if (unlikely(!ac->avail)) {
 		int x;
 force_grow:
-		x = cache_grow(cachep, gfp_exact_node(flags), node, NULL);
+		x = cache_grow(cachep, flags, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
@@ -3034,7 +3024,7 @@ retry:
 			get_node(cache, nid) &&
 			get_node(cache, nid)->free_objects) {
 				obj = ____cache_alloc_node(cache,
-					gfp_exact_node(flags), nid);
+					flags | __GFP_THISNODE, nid);
 				if (obj)
 					break;
 		}
@@ -3052,7 +3042,7 @@ retry:
 		if (local_flags & __GFP_WAIT)
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
-		page = kmem_getpages(cache, local_flags, numa_mem_id());
+		page = kmem_getpages(cache, local_flags, numa_mem_id(), true);
 		if (local_flags & __GFP_WAIT)
 			local_irq_disable();
 		if (page) {
@@ -3062,7 +3052,7 @@ retry:
 			nid = page_to_nid(page);
 			if (cache_grow(cache, flags, nid, page)) {
 				obj = ____cache_alloc_node(cache,
-					gfp_exact_node(flags), nid);
+					flags | __GFP_THISNODE, nid);
 				if (!obj)
 					/*
 					 * Another processor may allocate the
@@ -3133,7 +3123,7 @@ retry:
 
 must_grow:
 	spin_unlock(&n->list_lock);
-	x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL);
+	x = cache_grow(cachep, flags, nodeid, NULL);
 	if (x)
 		goto retry;
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-08-30 19:02 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-30 19:02 [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled Vladimir Davydov
2015-08-30 19:02 ` Vladimir Davydov [this message]
2015-08-30 19:02 ` [PATCH 2/2] mm/slub: do not bypass memcg reclaim for high-order page allocation Vladimir Davydov
2015-08-31 13:24 ` [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled Michal Hocko
2015-08-31 13:43   ` Tejun Heo
2015-08-31 14:30     ` Vladimir Davydov
2015-08-31 14:39       ` Tejun Heo
2015-08-31 15:18         ` Vladimir Davydov
2015-08-31 15:47           ` Tejun Heo
2015-08-31 16:51             ` Vladimir Davydov
2015-08-31 17:03               ` Tejun Heo
2015-08-31 19:26                 ` Vladimir Davydov
2015-08-31 20:22                   ` Christoph Lameter
2015-09-01  9:25                     ` Vladimir Davydov
2015-08-31 14:20   ` Vladimir Davydov
2015-08-31 14:46     ` Tejun Heo
2015-08-31 15:24       ` Vladimir Davydov
2015-09-01 12:36     ` Michal Hocko
2015-09-01 13:40       ` Vladimir Davydov
2015-09-01 15:01         ` Michal Hocko
2015-09-01 16:55           ` Vladimir Davydov
2015-09-01 18:38             ` Michal Hocko
2015-09-02  9:30               ` Vladimir Davydov
2015-09-02 18:16                 ` Christoph Lameter
2015-09-03  9:36                   ` Vladimir Davydov
2015-09-03 16:32                 ` Tejun Heo
2015-09-04 11:15                   ` Vladimir Davydov
2015-09-04 15:44                     ` Tejun Heo
2015-09-04 18:21                       ` Vladimir Davydov
2015-09-04 19:30                         ` Tejun Heo
2015-09-04 14:38                 ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b34f641e1ba70aeb8754188a2dfa6c2d641e4cdc.1440960578.git.vdavydov@parallels.com \
    --to=vdavydov@parallels.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox