Re: [PATCH 00/16] Swap-over-NBD without deadlocking V9

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Eric B Munson <emunson@mgebm.net>
To: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	Linux-Netdev <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	David Miller <davem@davemloft.net>, Neil Brown <neilb@suse.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Mike Christie <michaelc@cs.wisc.edu>
Subject: Re: [PATCH 00/16] Swap-over-NBD without deadlocking V9
Date: Sat, 21 Apr 2012 14:15:41 -0400	[thread overview]
Message-ID: <20120421181541.GC17039@mgebm.net> (raw)
In-Reply-To: <1334578624-23257-1-git-send-email-mgorman@suse.de>

[-- Attachment #1: Type: text/plain, Size: 7055 bytes --]

On Mon, 16 Apr 2012, Mel Gorman wrote:

> Changelog since V8
>   o Rebase to 3.4-rc2
>   o Use page flag instead of slab fields to keep structures the same size
>   o Properly detect allocations from softirq context that use PF_MEMALLOC
>   o Ensure kswapd does not sleep while processes are throttled
>   o Do not accidentally throttle !_GFP_FS processes indefinitely
> 
> Changelog since V7
>   o Rebase to 3.3-rc2
>   o Take greater care propagating page->pfmemalloc to skb
>   o Propagate pfmemalloc from netdev_alloc_page to skb where possible
>   o Release RCU lock properly on preempt kernel
> 
> Changelog since V6
>   o Rebase to 3.1-rc8
>   o Use wake_up instead of wake_up_interruptible()
>   o Do not throttle kernel threads
>   o Avoid a potential race between kswapd going to sleep and processes being
>     throttled
> 
> Changelog since V5
>   o Rebase to 3.1-rc5
> 
> Changelog since V4
>   o Update comment clarifying what protocols can be used		(Michal)
>   o Rebase to 3.0-rc3
> 
> Changelog since V3
>   o Propogate pfmemalloc from packet fragment pages to skb		(Neil)
>   o Rebase to 3.0-rc2
> 
> Changelog since V2
>   o Document that __GFP_NOMEMALLOC overrides __GFP_MEMALLOC		(Neil)
>   o Use wait_event_interruptible					(Neil)
>   o Use !! when casting to bool to avoid any possibilitity of type
>     truncation								(Neil)
>   o Nicer logic when using skb_pfmemalloc_protocol			(Neil)
> 
> Changelog since V1
>   o Rebase on top of mmotm
>   o Use atomic_t for memalloc_socks		(David Miller)
>   o Remove use of sk_memalloc_socks in vmscan	(Neil Brown)
>   o Check throttle within prepare_to_wait	(Neil Brown)
>   o Add statistics on throttling instead of printk
> 
> When a user or administrator requires swap for their application, they
> create a swap partition and file, format it with mkswap and activate it
> with swapon. Swap over the network is considered as an option in diskless
> systems. The two likely scenarios are when blade servers are used as part
> of a cluster where the form factor or maintenance costs do not allow the
> use of disks and thin clients.
> 
> The Linux Terminal Server Project recommends the use of the
> Network Block Device (NBD) for swap according to the manual at
> https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
> There is also documentation and tutorials on how to setup swap over NBD
> at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP
> The nbd-client also documents the use of NBD as swap. Despite this, the
> fact is that a machine using NBD for swap can deadlock within minutes if
> swap is used intensively. This patch series addresses the problem.
> 
> The core issue is that network block devices do not use mempools like
> normal block devices do. As the host cannot control where they receive
> packets from, they cannot reliably work out in advance how much memory
> they might need. Some years ago, Peter Ziljstra developed a series of
> patches that supported swap over an NFS that at least one distribution
> is carrying within their kernels. This patch series borrows very heavily
> from Peter's work to support swapping over NBD as a pre-requisite to
> supporting swap-over-NFS. The bulk of the complexity is concerned with
> preserving memory that is allocated from the PFMEMALLOC reserves for use
> by the network layer which is needed for both NBD and NFS.
> 
> Patch 1 serialises access to min_free_kbytes. It's not strictly needed
> 	by this series but as the series cares about watermarks in
> 	general, it's a harmless fix. It could be merged independently
> 	and may be if CMA is merged in advance.
> 
> Patch 2 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
> 	preserve access to pages allocated under low memory situations
> 	to callers that are freeing memory.
> 
> Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
> 	reserves without setting PFMEMALLOC.
> 
> Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
> 	for later use by network packet processing.
> 
> Patch 5 ignores memory policies when ALLOC_NO_WATERMARKS is set.
> 
> Patches 6-13 allows network processing to use PFMEMALLOC reserves when
> 	the socket has been marked as being used by the VM to clean pages. If
> 	packets are received and stored in pages that were allocated under
> 	low-memory situations and are unrelated to the VM, the packets
> 	are dropped.
> 
> 	Patch 11 reintroduces __netdev_alloc_page which the networking
> 	folk may object to but is needed in some cases to propogate
> 	pfmemalloc from a newly allocated page to an skb. If there is a
> 	strong objection, this patch can be dropped with the impact being
> 	that swap-over-network will be slower in some cases but it should
> 	not fail.
> 
> Patch 13 is a micro-optimisation to avoid a function call in the
> 	common case.
> 
> Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
> 	PFMEMALLOC if necessary.
> 
> Patch 15 notes that it is still possible for the PFMEMALLOC reserve
> 	to be depleted. To prevent this, direct reclaimers get throttled on
> 	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
> 	expected that kswapd and the direct reclaimers already running
> 	will clean enough pages for the low watermark to be reached and
> 	the throttled processes are woken up.
> 
> Patch 16 adds a statistic to track how often processes get throttled
> 
> Some basic performance testing was run using kernel builds, netperf
> on loopback for UDP and TCP, hackbench (pipes and sockets), iozone
> and sysbench. Each of them were expected to use the sl*b allocators
> reasonably heavily but there did not appear to be significant
> performance variances.
> 
> For testing swap-over-NBD, a machine was booted with 2G of RAM with a
> swapfile backed by NBD. 8*NUM_CPU processes were started that create
> anonymous memory mappings and read them linearly in a loop. The total
> size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
> memory pressure.
> 
> Without the patches and using SLUB, the machine locks up within minutes and
> runs to completion with them applied. With SLAB, the story is different
> as an unpatched kernel run to completion. However, the patched kernel
> completed the test 40% faster.
> 
>                                          3.4.0-rc2     3.4.0-rc2
>                                       vanilla-slab     swapnbd
> Sys Time Running Test (seconds)              87.90     73.45
> User+Sys Time Running Test (seconds)         91.93     76.91
> Total Elapsed Time (seconds)               4174.37   2953.96
> 

I have tested these with an artificial swap benchmark and with a large project
compile on a beagle board.  They work great for me.  My tests only used this
set via swap over NFS so it probably wasn't very thorough coverage.

Tested-by: Eric B Munson <emunson@mgebm.net>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

next prev parent reply	other threads:[~2012-04-21 18:15 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-16 12:16 Mel Gorman
2012-04-16 12:16 ` [PATCH 01/16] mm: Serialize access to min_free_kbytes Mel Gorman
2012-04-23 23:50   ` David Rientjes
2012-04-16 12:16 ` [PATCH 02/16] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages Mel Gorman
2012-04-23 23:51   ` David Rientjes
2012-04-25 15:05     ` Mel Gorman
2012-04-16 12:16 ` [PATCH 03/16] mm: slub: Optimise the SLUB fast path to avoid pfmemalloc checks Mel Gorman
2012-04-16 12:16 ` [PATCH 04/16] mm: Introduce __GFP_MEMALLOC to allow access to emergency reserves Mel Gorman
2012-04-16 12:16 ` [PATCH 05/16] mm: allow PF_MEMALLOC from softirq context Mel Gorman
2012-05-01 22:08   ` Andrew Morton
2012-05-02 16:24     ` Mel Gorman
2012-04-16 12:16 ` [PATCH 06/16] mm: Ignore mempolicies when using ALLOC_NO_WATERMARK Mel Gorman
2012-04-16 12:16 ` [PATCH 07/16] net: Introduce sk_allocation() to allow addition of GFP flags depending on the individual socket Mel Gorman
2012-04-16 12:16 ` [PATCH 08/16] netvm: Allow the use of __GFP_MEMALLOC by specific sockets Mel Gorman
2012-04-16 12:16 ` [PATCH 09/16] netvm: Allow skb allocation to use PFMEMALLOC reserves Mel Gorman
2012-04-16 12:16 ` [PATCH 10/16] netvm: Propagate page->pfmemalloc to skb Mel Gorman
2012-04-16 12:16 ` [PATCH 11/16] netvm: Propagate page->pfmemalloc from netdev_alloc_page " Mel Gorman
2012-04-16 12:16 ` [PATCH 12/16] netvm: Set PF_MEMALLOC as appropriate during SKB processing Mel Gorman
2012-04-16 12:17 ` [PATCH 13/16] mm: Micro-optimise slab to avoid a function call Mel Gorman
2012-04-16 12:17 ` [PATCH 14/16] nbd: Set SOCK_MEMALLOC for access to PFMEMALLOC reserves Mel Gorman
2012-04-16 12:17 ` [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage Mel Gorman
2012-05-01 22:24   ` Andrew Morton
2012-05-02 16:24     ` Mel Gorman
2012-04-16 12:17 ` [PATCH 16/16] mm: Account for the number of times direct reclaimers get throttled Mel Gorman
2012-04-21 18:15 ` Eric B Munson [this message]
2012-05-01 22:28 ` [PATCH 00/16] Swap-over-NBD without deadlocking V9 Andrew Morton
2012-05-03 15:00   ` Mel Gorman
2012-05-03 17:06     ` David Miller
2012-05-04 10:16       ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120421181541.GC17039@mgebm.net \
    --to=emunson@mgebm.net \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=michaelc@cs.wisc.edu \
    --cc=neilb@suse.de \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox