Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Gibson <david@gibson.dropbear.id.au>
To: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@osdl.org>,
	Ken Chen <kenneth.w.chen@intel.com>,
	Bill Irwin <wli@holomorphy.com>, Adam Litke <agl@us.ibm.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
Date: Wed, 25 Oct 2006 20:09:29 +1000	[thread overview]
Message-ID: <20061025100929.GA11040@localhost.localdomain> (raw)
In-Reply-To: <Pine.LNX.4.64.0610250841250.8576@blonde.wat.veritas.com>

On Wed, Oct 25, 2006 at 09:39:10AM +0100, Hugh Dickins wrote:
> On Wed, 25 Oct 2006, David Gibson wrote:
> > On Wed, Oct 25, 2006 at 03:38:24AM +0100, Hugh Dickins wrote:
> > > If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
> > > area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative".  Reinstate
> > > my preliminary i_size check before attempting to allocate the page (though
> > > this only fixes the most obvious case: more work will be needed here).
> > > 
> > > Signed-off-by: Hugh Dickins <hugh@veritas.com>
> > > ___
> > > 
> > > This is not a complete solution (what if hugetlb_no_page is actually
> > > racing with truncate_hugepages?), and there are several other accounting
> > > anomalies in here (private versus shared pages, hugetlbfs quota handling);
> > > but those all need more thought.  It'll probably make sense to use i_mutex
> > > instead of hugetlb_instantiation_mutex, so locking out truncation
> > > and mmap.
> > 
> > Ah, yes.  I also encountered this one a few days ago - I found it in
> > the context of deserializing the hugepage fault path, which makes the
> > problem worse, and forgot to consider if there was also a problem in
> > the original case.
> > 
> > In fact, there's a second problem with the current location of the
> > i_size check.  As well as wrapping the reserved count, if there's a
> > fault on a truncated area and the hugepage pool is also empty, we can
> > get an OOM SIGKILL instead of the correct SIGBUS.
> 
> That's exactly why I put in the preliminary i_size check originally,
> which you guys then decided was unnecessary.

We did?  That was silly of us..

> But it wasn't worth
> arguing over which particular error manifested in that case.

> > I don't things are quite as bad as you fear, though:  I believe the
> > page lock protects us against racing concurrent truncations (this is
> > one reason we have find_lock_page() here, rather than the
> > find_get_page() which appears in the analagous normal page path).
> 
> The page lock protects once you've got it.  But when getting a new
> page, there's a sequence of operations before getting the page lock,
> where a racing truncate may occur, and then we may add a page to cache
> beyond the i_size.  And not remove it until the file is retruncated
> or deleted or mount unmounted: not the worst of leaks, but not what
> we want either.

Ah.  Yes.  Indeed.

Possibly we need to duplicate the truncate_count logic from the normal
page path.

> > I suggest the slightly revised patch below, which doesn't duplicate
> > the i_size test, and cleans up the backout path (removing a
> > no-longer-useful goto label) in the process.
> 
> I don't see the advantage of that version: I'd rather stick with my
> clearly-not-worse two-liner for 2.6.19, and work on fixing it all up
> properly later.

Yeah, I guess so.

> I was only trying to fix the prio_tree issue (which had led me to
> mislead Ken on his prio_tree use - thankfully he ignored me and stuck
> with what worked): but each time I tested something I found something
> else wrong, right now must switch away, so posting the obvious bits.
> 
> To expand a little on the other problems here: the hugetlb_get_quota
> and hugetlb_put_quota calls are suspect but hard to get right, and
> the resv_huge_pages-- in alloc_huge_page a problem too I think.
> Cached versus private pages seems confused (private pages should
> not be counted out of the superblock quota).
> 
> It would probably straighten out if hugetlb_no_page just dealt with
> with page cache pages (no VM_SHARED versus not path), and left the
> private pages to hugetlb_cow; but that would be wasteful.

In some cases exceedingly wasteful.  I don't think we can do this.

> And almost(?) all the backtracking could be taken out if i_mutex
> were held; hugetlbfs_file_mmap is already taking i_mutex within
> mmap_sem (contrary to usual mm lock ordering, but probably okay
> since hugetlbfs has no read/write, though lockdep may need teaching).
> Though serializing these faults at all is regrettable.

Um, yes.  Especially when I was in the middle of attempting to
de-serialize it.  Christoph Lameter has userspace stuff to do hugepage
initialization (clearing mostly), in parallal, which obviously won't
work with the serialization.  I have a tentative patch to address it,
which replaces the hugetlb_instantiation_mutex with a table of
mutexes, hashed on address_space and offset (or struct mm and address
for MAP_PRIVATE).  Originally I tried to simply remove the mutex, and
just retry faults when we got an OOM but a race was detected.  After
several variants each on 2 or 3 basic approaches, each of which turned
out to be less race-free than I originally thought, I gave up and went
with the hashed mutexes.  Either way though, there will still be
i_size issues to sort out.

I'm also not sure that i_mutex will work in any case.  I'm pretty sure
I looked for a suitable existing lock, before I created
hugetlb_instantiation_mutex.  I'm not at all sure, but I have a nasty
suspicion there's some not-immediately-obvious lock-ordering
constraint that means we can't take i_mutex across the fault.

> Things may not be as bad as I fear, but I've certainly not yet
> emerged into the light on it all.
> 
> Hugh
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2006-10-25 10:09 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-10-25  2:31 [PATCH 1/3] hugetlb: fix size=4G parsing Hugh Dickins
2006-10-25  2:35 ` [PATCH 2/3] hugetlb: fix prio_tree unit Hugh Dickins
2006-10-25  7:08   ` David Gibson
2006-10-25  7:41     ` Hugh Dickins
2006-10-25 23:49       ` Chen, Kenneth W
2006-10-26  3:47         ` David Gibson
2006-10-26  6:15           ` Chen, Kenneth W
2006-10-26  7:55           ` Hugh Dickins
2006-10-26  8:13           ` Hugh Dickins
2006-10-26 10:42             ` David Gibson
2006-10-25  2:38 ` [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd Hugh Dickins
2006-10-25  5:23   ` Mika Penttilä
2006-10-25  5:52     ` David Gibson
2006-10-25  7:27       ` Hugh Dickins
2006-10-25  6:26   ` David Gibson
2006-10-25  6:29     ` David Gibson
2006-10-25  8:39     ` Hugh Dickins
2006-10-25 10:09       ` David Gibson [this message]
2006-10-26  3:59         ` Chen, Kenneth W
2006-10-26  4:13           ` 'David Gibson'
2006-10-26 19:08           ` Christoph Lameter
2006-10-26 19:19             ` Chen, Kenneth W
2006-10-26 20:59               ` Christoph Lameter
2006-10-26 22:19               ` 'David Gibson'
2006-10-25 21:31     ` Adam Litke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061025100929.GA11040@localhost.localdomain \
    --to=david@gibson.dropbear.id.au \
    --cc=agl@us.ibm.com \
    --cc=akpm@osdl.org \
    --cc=hugh@veritas.com \
    --cc=kenneth.w.chen@intel.com \
    --cc=linux-mm@kvack.org \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox