linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Christoph Lameter <clameter@sgi.com>
Cc: Daniel Phillips <phillips@phunq.net>,
	Matt Mackall <mpm@selenic.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	David Miller <davem@davemloft.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Daniel Phillips <phillips@google.com>,
	Pekka Enberg <penberg@cs.helsinki.fi>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Steve Dickson <SteveD@redhat.com>
Subject: Re: [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK
Date: Wed, 08 Aug 2007 09:24:25 +0200	[thread overview]
Message-ID: <1186557865.7182.86.camel@twins> (raw)
In-Reply-To: <Pine.LNX.4.64.0708071513290.3683@schroedinger.engr.sgi.com>

[-- Attachment #1: Type: text/plain, Size: 6007 bytes --]

On Tue, 2007-08-07 at 15:18 -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Daniel Phillips wrote:
> 
> > > AFAICT: This patchset is not throttling processes but failing
> > > allocations.
> > 
> > Failing allocations?  Where do you see that?  As far as I can see, 
> > Peter's patch set allows allocations to fail exactly where the user has 
> > always specified they may fail, and in no new places.  If there is a 
> > flaw in that logic, please let us know.
> 
> See the code added to slub: Allocations are satisfied from the reserve 
> patch or they are failing.

Allocations are satisfied from the reserve IFF the allocation context is
entitled to the reserve, otherwise it will try to allocate a new slab.
And that is exactly like any other low mem situation, allocations do
fail under pressure, but not more so with this patch.

> > > The patchset does not reconfigure the memory reserves as 
> > > expected.
> > 
> > What do you mean by that?  Expected by who?
> 
> What would be expected it some recalculation of min_freekbytes?

And I have been telling you:

        if (alloc_flags & ALLOC_HIGH)
                min -= min / 2;
        if (alloc_flags & ALLOC_HARDER)
                min -= min / 4;

so the reserve is 3/8 of min_freekbytes, a fixed limit is required.


> > > Code is added that is supposedly not used.
> > 
> > What makes you think that?
> 
> Because the argument is that performance does not matter since the code 
> patchs are not used.

The argument is that once you hit these code paths, you don't much care
for performance, not the other way around.

> > And I suspect that we  
> > have the same issues as in earlier releases with various corner cases
> > > not being covered.
> > 
> > Do you have an example?
> 
> Try NUMA constraints and zone limitations.

How exactly are these failing, the breaking out of the policy boundaries
in the reserve path is not different from IRQ context allocations not
honouring them.

> > > If it  ever is on a large config then we are in very deep trouble by
> > > the new code paths themselves that serialize things in order to give
> > > some allocations precendence over the other allocations that are made
> > > to fail ....
> > 
> > You mean by allocating the reserve memory on the wrong node in NUMA?  
> 
> No I mean all 1024 processors of our system running into this fail/succeed 
> thingy that was added.

You mean to say, 1024 cpus running into the kmem_cache wide reserve
slab, yeah if that happens that will hurt.

I just instrumented the kernel to find the allocations done under
PF_MEMALLOC for a heavy (non-swapping) reclaim load:

localhost ~ # cat /proc/ip_trace 
79	[<000000006001c140>] do_ubd_request+0x97/0x176
1	[<000000006006dfef>] allocate_slab+0x44/0x9a
165	[<00000000600e75d7>] new_handle+0x1d/0x47
1	[<00000000600ee897>] __jbd_kmalloc+0x18/0x1a
141	[<00000000600eeb2d>] journal_alloc_journal_head+0x15/0x6f
1	[<0000000060136907>] current_io_context+0x38/0x95
1	[<0000000060139634>] alloc_as_io_context+0x18/0x95

this is from about 15 mins of load-5 trashing file backed reclaim.

So sadly I was mistaken, you will run into it, albeit from the numbers
its not very frequent. (and here I though that path was fully covered
with mempools)

> > That is on a code path that avoids destroying your machine performance 
> > or killing the machine entirely as with current kernels, for which a 
> 
> As far as I know from our systems: The current kernels do not kill the 
> machine if the reserves are configured the right way.

AFAIK the current kernels do not allow for swap over nfs and a bunch of
other things.

The scenario I'm interested in is (two machines: client [A]/server [B]):

 1) networked machine [A] runs swap-over-net (NFS/iSCSI)
 2) someone trips over the NFS/iSCSI server's [B] ethernet cable
   / someone does /etc/init.d/nfs stop

 3) the machine [A] stops dead in its tracks in the middle of reclaim
 4) the machine [A] keeps on receiving unrelated network traffic for
    whatever other purpose the machine served

   - time passes -

 5) cable gets reconnected / nfs server started [B]
 6) client [A] succeeds with the reconnect and happily continues its
    swapping life.

Between 4-5 it needs to receive and discard an unspecified amount of
network traffic while user-space is basically frozen solid.

FWIW this scenario works here.

> > few cachelines pulled to another node is a small price to pay.  And you 
> > are free to use your special expertise in NUMA to make those fallback 
> > paths even more efficient, but first you need to understand what they 
> > are doing and why.
> 
> There is your problem. The justification is not clear at all and the 
> solution likely causes unrelated problems.

The situation I'm wanting to avoid is during 3), the machine will slowly
but steadily freeze over as userspace allocations start to block on
outstanding swap-IO.

Since I need a fixed reserve to operate 4), I do not want these slowly
freezing processes to 'accidentally' gobble up slabs allocated from the
reserve for more important matters.

So what the modification to the slab allocator does is:

 - if there is a reserve slab:
   - if the context allows access to the reserves
     - serve object
   - otherwise try to allocate a new slab
     - if this succeeds
       - remove the reserve slab since clearly the pressure is gone
       - serve from the new slab
     - otherwise fail

Now that 'fail'-part scares you, however, note that __GFP_WAIT
allocations will stick in the allocate part for a while, just like
regular low memory situations.


Anyway, I'm in a bit of a mess having proven even file backed reclaim
hits these paths.

 1) I'm reluctant to add #ifdef in core allocation paths
 2) I'm reluctant to add yet another NR_CPUS array to struct kmem_cache

Christoph, does this all explain the situation?

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

  reply	other threads:[~2007-08-08  7:24 UTC|newest]

Thread overview: 85+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-06 10:29 [PATCH 00/10] foundations for reserve-based allocation Peter Zijlstra
2007-08-06 10:29 ` [PATCH 01/10] mm: gfp_to_alloc_flags() Peter Zijlstra
2007-08-06 10:29 ` [PATCH 02/10] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
2007-08-06 18:11   ` Christoph Lameter
2007-08-06 18:21     ` Daniel Phillips
2007-08-06 18:31       ` Peter Zijlstra
2007-08-06 18:43         ` Daniel Phillips
2007-08-06 19:11         ` Christoph Lameter
2007-08-06 19:31           ` Peter Zijlstra
2007-08-06 20:12             ` Christoph Lameter
2007-08-06 18:42       ` Christoph Lameter
2007-08-06 18:48         ` Daniel Phillips
2007-08-06 18:51           ` Christoph Lameter
2007-08-06 19:15             ` Daniel Phillips
2007-08-06 20:12             ` Matt Mackall
2007-08-06 20:19               ` Christoph Lameter
2007-08-06 20:26                 ` Peter Zijlstra
2007-08-06 21:05                   ` Christoph Lameter
2007-08-06 22:59                     ` Daniel Phillips
2007-08-06 23:14                       ` Christoph Lameter
2007-08-06 23:49                         ` Daniel Phillips
2007-08-07 22:18                           ` Christoph Lameter
2007-08-08  7:24                             ` Peter Zijlstra [this message]
2007-08-08 18:06                               ` Christoph Lameter
2007-08-08  7:37                             ` Daniel Phillips
2007-08-08 18:09                               ` Christoph Lameter
2007-08-09 18:41                                 ` Daniel Phillips
2007-08-09 18:49                                   ` Christoph Lameter
2007-08-10  0:17                                     ` Daniel Phillips
2007-08-10  1:48                                       ` Christoph Lameter
2007-08-10  3:34                                         ` Daniel Phillips
2007-08-10  3:48                                           ` Christoph Lameter
2007-08-10  8:15                                             ` Daniel Phillips
2007-08-10 17:46                                               ` Christoph Lameter
2007-08-10 23:25                                                 ` Daniel Phillips
2007-08-13  6:55                                                 ` Daniel Phillips
2007-08-13 23:04                                                   ` Christoph Lameter
2007-08-06 20:27                 ` Andrew Morton
2007-08-06 23:16                   ` Daniel Phillips
2007-08-06 22:47                 ` Daniel Phillips
2007-08-06 10:29 ` [PATCH 03/10] mm: tag reseve pages Peter Zijlstra
2007-08-06 18:11   ` Christoph Lameter
2007-08-06 18:13     ` Daniel Phillips
2007-08-06 18:28     ` Peter Zijlstra
2007-08-06 19:34     ` Andi Kleen
2007-08-06 18:43       ` Christoph Lameter
2007-08-06 18:47         ` Peter Zijlstra
2007-08-06 18:59           ` Andi Kleen
2007-08-06 19:09             ` Christoph Lameter
2007-08-06 19:10             ` Andrew Morton
2007-08-06 19:16               ` Christoph Lameter
2007-08-06 19:38               ` Matt Mackall
2007-08-06 20:18               ` Andi Kleen
2007-08-06 10:29 ` [PATCH 04/10] mm: slub: add knowledge of reserve pages Peter Zijlstra
2007-08-08  0:13   ` Christoph Lameter
2007-08-08  1:44     ` Matt Mackall
2007-08-08 17:13       ` Christoph Lameter
2007-08-08 17:39         ` Andrew Morton
2007-08-08 17:57           ` Christoph Lameter
2007-08-08 18:46             ` Andrew Morton
2007-08-10  1:54               ` Daniel Phillips
2007-08-10  2:01                 ` Christoph Lameter
2007-08-20  7:38   ` Peter Zijlstra
2007-08-20  7:43     ` Peter Zijlstra
2007-08-20  9:12     ` Pekka J Enberg
2007-08-20  9:17       ` Peter Zijlstra
2007-08-20  9:28         ` Pekka Enberg
2007-08-20 19:26           ` Christoph Lameter
2007-08-20 20:08             ` Peter Zijlstra
2007-08-06 10:29 ` [PATCH 05/10] mm: allow mempool to fall back to memalloc reserves Peter Zijlstra
2007-08-06 10:29 ` [PATCH 06/10] mm: kmem_estimate_pages() Peter Zijlstra
2007-08-06 10:29 ` [PATCH 07/10] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2007-08-06 10:29 ` [PATCH 08/10] mm: serialize access to min_free_kbytes Peter Zijlstra
2007-08-06 10:29 ` [PATCH 09/10] mm: emergency pool Peter Zijlstra
2007-08-06 10:29 ` [PATCH 10/10] mm: __GFP_MEMALLOC Peter Zijlstra
2007-08-06 17:35 ` [PATCH 00/10] foundations for reserve-based allocation Daniel Phillips
2007-08-06 18:17   ` Peter Zijlstra
2007-08-06 18:40     ` Daniel Phillips
2007-08-06 19:31     ` Daniel Phillips
2007-08-06 19:36       ` Peter Zijlstra
2007-08-06 19:53         ` Daniel Phillips
2007-08-06 17:56 ` Christoph Lameter
2007-08-06 18:33   ` Peter Zijlstra
2007-08-06 20:23 ` Matt Mackall
2007-08-07  0:09   ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1186557865.7182.86.camel@twins \
    --to=a.p.zijlstra@chello.nl \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=SteveD@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=davem@davemloft.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mpm@selenic.com \
    --cc=penberg@cs.helsinki.fi \
    --cc=phillips@google.com \
    --cc=phillips@phunq.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox