[PATCH 01/40] mm: page allocation rank

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, netdev@vger.kernel.org
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Trond Myklebust <trond.myklebust@fys.uio.no>,
	Thomas Graf <tgraf@suug.ch>, David Miller <davem@davemloft.net>,
	James Bottomley <James.Bottomley@SteelEye.com>,
	Mike Christie <michaelc@cs.wisc.edu>,
	Andrew Morton <akpm@linux-foundation.org>,
	Daniel Phillips <phillips@google.com>
Subject: [PATCH 01/40] mm: page allocation rank
Date: Fri, 04 May 2007 12:26:52 +0200	[thread overview]
Message-ID: <20070504103155.582370247@chello.nl> (raw)
In-Reply-To: <20070504102651.923946304@chello.nl>

[-- Attachment #1: mm-page_alloc-rank.patch --]
[-- Type: text/plain, Size: 7975 bytes --]

Introduce page allocation rank.

This allocation rank is an measure of the 'hardness' of the page allocation.
Where hardness refers to how deep we have to reach (and thereby if reclaim 
was activated) to obtain the page.

It basically is a mapping from the ALLOC_/gfp flags into a scalar quantity,
which allows for comparisons of the kind: 
  'would this allocation have succeeded using these gfp flags'.

For the gfp -> alloc_flags mapping we use the 'hardest' possible, those
used by __alloc_pages() right before going into direct reclaim.

The alloc_flags -> rank mapping is given by: 2*2^wmark - harder - 2*high
where wmark = { min = 1, low, high } and harder, high are booleans.
This gives:
  0 is the hardest possible allocation - ALLOC_NO_WATERMARK,
  1 is ALLOC_WMARK_MIN|ALLOC_HARDER|ALLOC_HIGH,
  ...
  15 is ALLOC_WMARK_HIGH|ALLOC_HARDER,
  16 is the softest allocation - ALLOC_WMARK_HIGH.

Rank <= 4 will have woke up kswapd and when also > 0 might have ran into
direct reclaim.

Rank > 8 rarely happens and means lots of memory free (due to parallel oom kill).

The allocation rank is stored in page->index for successful allocations.

'offline' testing of the rank is made impossible by direct reclaim and
fragmentation issues. That is, it is impossible to tell if a given allocation
will succeed without actually doing it.

The purpose of this measure is to introduce some fairness into the slab
allocator.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/internal.h   |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c |   58 +++++++++++++---------------------------------
 2 files changed, 87 insertions(+), 41 deletions(-)

Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h	2007-02-22 13:56:00.000000000 +0100
+++ linux-2.6-git/mm/internal.h	2007-02-22 14:08:41.000000000 +0100
@@ -12,6 +12,7 @@
 #define __MM_INTERNAL_H
 
 #include <linux/mm.h>
+#include <linux/hardirq.h>
 
 static inline void set_page_count(struct page *page, int v)
 {
@@ -37,4 +38,73 @@ static inline void __put_page(struct pag
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
+#define ALLOC_HARDER		0x01 /* try to alloc harder */
+#define ALLOC_HIGH		0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN		0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW		0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH	0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS	0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+
+/*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+static int inline gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+	struct task_struct *p = current;
+	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	/*
+	 * The caller may dip into page reserves a bit more if the caller
+	 * cannot run direct reclaim, or if the caller has realtime scheduling
+	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+	 */
+	if (gfp_mask & __GFP_HIGH)
+		alloc_flags |= ALLOC_HIGH;
+
+	if (!wait) {
+		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+		 */
+		alloc_flags &= ~ALLOC_CPUSET;
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
+		alloc_flags |= ALLOC_HARDER;
+
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (!in_interrupt() &&
+		    ((p->flags & PF_MEMALLOC) ||
+		     unlikely(test_thread_flag(TIF_MEMDIE))))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+	}
+
+	return alloc_flags;
+}
+
+#define MAX_ALLOC_RANK	16
+
+/*
+ * classify the allocation: 0 is hardest, 16 is easiest.
+ */
+static inline int alloc_flags_to_rank(int alloc_flags)
+{
+	int rank;
+
+	if (alloc_flags & ALLOC_NO_WATERMARKS)
+		return 0;
+
+	rank = alloc_flags & (ALLOC_WMARK_MIN|ALLOC_WMARK_LOW|ALLOC_WMARK_HIGH);
+	rank -= alloc_flags & (ALLOC_HARDER|ALLOC_HIGH);
+
+	return rank;
+}
+
+static inline int gfp_to_rank(gfp_t gfp_mask)
+{
+	return alloc_flags_to_rank(gfp_to_alloc_flags(gfp_mask));
+}
+
 #endif
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c	2007-02-22 13:56:00.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c	2007-02-22 14:08:41.000000000 +0100
@@ -892,14 +892,6 @@ failed:
 	return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS	0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN		0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW		0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH	0x08 /* use pages_high watermark */
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1190,6 +1182,7 @@ zonelist_scan:
 
 		page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
 		if (page)
+			page->index = alloc_flags_to_rank(alloc_flags);
 			break;
 this_zone_full:
 		if (NUMA_BUILD)
@@ -1263,48 +1256,27 @@ restart:
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
 	 * to how we want to proceed.
-	 *
-	 * The caller may dip into page reserves a bit more if the caller
-	 * cannot run direct reclaim, or if the caller has realtime scheduling
-	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 	 */
-	alloc_flags = ALLOC_WMARK_MIN;
-	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
-		alloc_flags |= ALLOC_HARDER;
-	if (gfp_mask & __GFP_HIGH)
-		alloc_flags |= ALLOC_HIGH;
-	if (wait)
-		alloc_flags |= ALLOC_CPUSET;
+	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-	/*
-	 * Go through the zonelist again. Let __GFP_HIGH and allocations
-	 * coming from realtime tasks go deeper into reserves.
-	 *
-	 * This is the last chance, in general, before the goto nopage.
-	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+	/* This is the last chance, in general, before the goto nopage. */
+	page = get_page_from_freelist(gfp_mask, order, zonelist,
+			alloc_flags & ~ALLOC_NO_WATERMARKS);
 	if (page)
 		goto got_pg;
 
 	/* This allocation should allow future memory freeing. */
-
 rebalance:
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 nofail_alloc:
-			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+		/* go through the zonelist yet again, ignoring mins */
+		page = get_page_from_freelist(gfp_mask, order,
 				zonelist, ALLOC_NO_WATERMARKS);
-			if (page)
-				goto got_pg;
-			if (gfp_mask & __GFP_NOFAIL) {
-				congestion_wait(WRITE, HZ/50);
-				goto nofail_alloc;
-			}
+		if (page)
+			goto got_pg;
+		if (wait && (gfp_mask & __GFP_NOFAIL)) {
+			congestion_wait(WRITE, HZ/50);
+			goto nofail_alloc;
 		}
 		goto nopage;
 	}
@@ -1313,6 +1285,10 @@ nofail_alloc:
 	if (!wait)
 		goto nopage;
 
+	/* Avoid recursion of direct reclaim */
+	if (p->flags & PF_MEMALLOC)
+		goto nopage;
+
 	cond_resched();
 
 	/* We now go into synchronous reclaim */

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-05-04 10:26 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-04 10:26 [PATCH 00/40] Swap over Networked storage -v12 Peter Zijlstra
2007-05-04 10:26 ` Peter Zijlstra [this message]
2007-05-04 10:26 ` [PATCH 02/40] mm: slab allocation fairness Peter Zijlstra
2007-05-16 20:41   ` Christoph Lameter
2007-05-04 10:26 ` [PATCH 03/40] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2007-05-04 10:26 ` [PATCH 04/40] mm: serialize access to min_free_kbytes Peter Zijlstra
2007-05-04 10:26 ` [PATCH 05/40] mm: emergency pool Peter Zijlstra
2007-05-04 10:26 ` [PATCH 06/40] mm: __GFP_EMERGENCY Peter Zijlstra
2007-05-04 10:26 ` [PATCH 07/40] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
2007-05-04 10:26 ` [PATCH 08/40] mm: kmem_cache_objsize Peter Zijlstra
2007-05-04 10:54   ` Pekka Enberg
2007-05-04 16:09     ` Christoph Lameter
2007-05-04 16:15       ` Peter Zijlstra
2007-05-04 16:23         ` Christoph Lameter
2007-05-04 16:30           ` Peter Zijlstra
2007-05-04 16:36   ` Christoph Lameter
2007-05-04 17:59     ` Peter Zijlstra
2007-05-04 18:04       ` Christoph Lameter
2007-05-04 18:21         ` Peter Zijlstra
2007-05-04 18:30           ` Christoph Lameter
2007-05-04 18:32             ` Peter Zijlstra
2007-05-04 18:45               ` Pekka Enberg
2007-05-04 18:47                 ` Christoph Lameter
2007-05-04 18:54                   ` Pekka Enberg
2007-05-04 19:59                     ` Christoph Lameter
2007-05-05  9:00                       ` Pekka J Enberg
2007-05-04 18:41             ` Pekka Enberg
2007-05-04 18:46               ` Christoph Lameter
2007-05-04 18:53                 ` Pekka Enberg
2007-05-04 19:58                   ` Christoph Lameter
2007-05-04 10:27 ` [PATCH 09/40] mm: optimize gfp_to_rank() Peter Zijlstra
2007-05-04 10:27 ` [PATCH 10/40] selinux: tag avc cache alloc as non-critical Peter Zijlstra
2007-05-04 10:27 ` [PATCH 11/40] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
2007-05-04 10:27 ` [PATCH 12/40] net: packet split receive api Peter Zijlstra
2007-05-04 10:27 ` [PATCH 13/40] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
2007-05-04 10:27 ` [PATCH 14/40] netvm: link network to vm layer Peter Zijlstra
2007-05-04 10:27 ` [PATCH 15/40] netvm: INET reserves Peter Zijlstra
2007-05-04 10:27 ` [PATCH 16/40] netvm: hook skb allocation to reserves Peter Zijlstra
2007-05-04 14:07   ` Arnaldo Carvalho de Melo
2007-05-04 10:27 ` [PATCH 17/40] netvm: filter emergency skbs Peter Zijlstra
2007-05-04 10:27 ` [PATCH 18/40] netvm: prevent a TCP specific deadlock Peter Zijlstra
2007-05-04 10:27 ` [PATCH 19/40] netfilter: notify about NF_QUEUE vs emergency skbs Peter Zijlstra
2007-05-04 10:27 ` [PATCH 20/40] netvm: skb processing Peter Zijlstra
2007-05-04 10:27 ` [PATCH 21/40] uml: rename arch/um remove_mapping() Peter Zijlstra
2007-05-04 10:27 ` [PATCH 22/40] mm: prepare swap entry methods for use in page methods Peter Zijlstra
2007-05-04 10:27 ` [PATCH 23/40] mm: add support for non block device backed swap files Peter Zijlstra
2007-05-04 10:27 ` [PATCH 24/40] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
2007-05-04 10:27 ` [PATCH 25/40] nfs: remove mempools Peter Zijlstra
2007-05-04 10:27 ` [PATCH 26/40] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
2007-05-04 10:27 ` [PATCH 27/40] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
2007-05-04 10:27 ` [PATCH 28/40] nfs: enable swap on NFS Peter Zijlstra
2007-05-04 10:27 ` [PATCH 29/40] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
2007-05-04 10:27 ` [PATCH 30/40] nfs: fixup missing error code Peter Zijlstra
2007-05-04 13:10   ` Peter Staubach
2007-05-04 13:18     ` Peter Zijlstra
2007-05-04 10:27 ` [PATCH 31/40] mm: balance_dirty_pages() vs throttle_vm_writeout() deadlock Peter Zijlstra
2007-05-04 10:27 ` [PATCH 32/40] block: add a swapdev callback to the request_queue Peter Zijlstra
2007-05-04 10:27 ` [PATCH 33/40] uml: enable scsi and add iscsi config Peter Zijlstra
2007-05-04 10:27 ` [PATCH 34/40] sock: safely expose kernel sockets to userspace Peter Zijlstra
2007-05-04 10:27 ` [PATCH 35/40] From: Mike Christie <mchristi@redhat.com> Peter Zijlstra
2007-05-04 10:27 ` [PATCH 36/40] iscsi: fixup of the ep_connect patch Peter Zijlstra
2007-05-04 10:27 ` [PATCH 37/40] iscsi: ensure the iscsi kernel fd is not usable in userspace Peter Zijlstra
2007-05-04 10:27 ` [PATCH 38/40] netlink: add SOCK_VMIO support to AF_NETLINK Peter Zijlstra
2007-05-04 10:27 ` [PATCH 39/40] mm: a process flags to avoid blocking allocations Peter Zijlstra
2007-05-04 10:27 ` [PATCH 40/40] iscsi: support for swapping over iSCSI Peter Zijlstra
2007-05-04 15:22 ` [PATCH 00/40] Swap over Networked storage -v12 Daniel Walker
2007-05-04 15:38   ` Peter Zijlstra
2007-05-04 15:59     ` Daniel Walker
2007-05-04 18:09       ` Mike Snitzer
2007-05-04 19:31         ` Daniel Walker
2007-05-04 19:54         ` David Miller, Mike Snitzer
2007-05-04 21:36   ` Arnaldo Carvalho de Melo
2007-05-04 19:27 ` David Miller, Peter Zijlstra
2007-05-04 19:41   ` Peter Zijlstra
2007-05-04 20:02     ` David Miller, Peter Zijlstra
2007-05-04 20:29       ` Jeff Garzik
2007-05-05  9:43   ` Christoph Hellwig
2007-05-05  9:55     ` William Lee Irwin III

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070504103155.582370247@chello.nl \
    --to=a.p.zijlstra@chello.nl \
    --cc=James.Bottomley@SteelEye.com \
    --cc=akpm@linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=michaelc@cs.wisc.edu \
    --cc=netdev@vger.kernel.org \
    --cc=phillips@google.com \
    --cc=tgraf@suug.ch \
    --cc=trond.myklebust@fys.uio.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox