From: David Gibson <david@gibson.dropbear.id.au>
To: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@osdl.org>,
Ken Chen <kenneth.w.chen@intel.com>,
Bill Irwin <wli@holomorphy.com>, Adam Litke <agl@us.ibm.com>,
linux-mm@kvack.org
Subject: Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
Date: Wed, 25 Oct 2006 20:09:29 +1000 [thread overview]
Message-ID: <20061025100929.GA11040@localhost.localdomain> (raw)
In-Reply-To: <Pine.LNX.4.64.0610250841250.8576@blonde.wat.veritas.com>
On Wed, Oct 25, 2006 at 09:39:10AM +0100, Hugh Dickins wrote:
> On Wed, 25 Oct 2006, David Gibson wrote:
> > On Wed, Oct 25, 2006 at 03:38:24AM +0100, Hugh Dickins wrote:
> > > If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
> > > area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative". Reinstate
> > > my preliminary i_size check before attempting to allocate the page (though
> > > this only fixes the most obvious case: more work will be needed here).
> > >
> > > Signed-off-by: Hugh Dickins <hugh@veritas.com>
> > > ___
> > >
> > > This is not a complete solution (what if hugetlb_no_page is actually
> > > racing with truncate_hugepages?), and there are several other accounting
> > > anomalies in here (private versus shared pages, hugetlbfs quota handling);
> > > but those all need more thought. It'll probably make sense to use i_mutex
> > > instead of hugetlb_instantiation_mutex, so locking out truncation
> > > and mmap.
> >
> > Ah, yes. I also encountered this one a few days ago - I found it in
> > the context of deserializing the hugepage fault path, which makes the
> > problem worse, and forgot to consider if there was also a problem in
> > the original case.
> >
> > In fact, there's a second problem with the current location of the
> > i_size check. As well as wrapping the reserved count, if there's a
> > fault on a truncated area and the hugepage pool is also empty, we can
> > get an OOM SIGKILL instead of the correct SIGBUS.
>
> That's exactly why I put in the preliminary i_size check originally,
> which you guys then decided was unnecessary.
We did? That was silly of us..
> But it wasn't worth
> arguing over which particular error manifested in that case.
> > I don't things are quite as bad as you fear, though: I believe the
> > page lock protects us against racing concurrent truncations (this is
> > one reason we have find_lock_page() here, rather than the
> > find_get_page() which appears in the analagous normal page path).
>
> The page lock protects once you've got it. But when getting a new
> page, there's a sequence of operations before getting the page lock,
> where a racing truncate may occur, and then we may add a page to cache
> beyond the i_size. And not remove it until the file is retruncated
> or deleted or mount unmounted: not the worst of leaks, but not what
> we want either.
Ah. Yes. Indeed.
Possibly we need to duplicate the truncate_count logic from the normal
page path.
> > I suggest the slightly revised patch below, which doesn't duplicate
> > the i_size test, and cleans up the backout path (removing a
> > no-longer-useful goto label) in the process.
>
> I don't see the advantage of that version: I'd rather stick with my
> clearly-not-worse two-liner for 2.6.19, and work on fixing it all up
> properly later.
Yeah, I guess so.
> I was only trying to fix the prio_tree issue (which had led me to
> mislead Ken on his prio_tree use - thankfully he ignored me and stuck
> with what worked): but each time I tested something I found something
> else wrong, right now must switch away, so posting the obvious bits.
>
> To expand a little on the other problems here: the hugetlb_get_quota
> and hugetlb_put_quota calls are suspect but hard to get right, and
> the resv_huge_pages-- in alloc_huge_page a problem too I think.
> Cached versus private pages seems confused (private pages should
> not be counted out of the superblock quota).
>
> It would probably straighten out if hugetlb_no_page just dealt with
> with page cache pages (no VM_SHARED versus not path), and left the
> private pages to hugetlb_cow; but that would be wasteful.
In some cases exceedingly wasteful. I don't think we can do this.
> And almost(?) all the backtracking could be taken out if i_mutex
> were held; hugetlbfs_file_mmap is already taking i_mutex within
> mmap_sem (contrary to usual mm lock ordering, but probably okay
> since hugetlbfs has no read/write, though lockdep may need teaching).
> Though serializing these faults at all is regrettable.
Um, yes. Especially when I was in the middle of attempting to
de-serialize it. Christoph Lameter has userspace stuff to do hugepage
initialization (clearing mostly), in parallal, which obviously won't
work with the serialization. I have a tentative patch to address it,
which replaces the hugetlb_instantiation_mutex with a table of
mutexes, hashed on address_space and offset (or struct mm and address
for MAP_PRIVATE). Originally I tried to simply remove the mutex, and
just retry faults when we got an OOM but a race was detected. After
several variants each on 2 or 3 basic approaches, each of which turned
out to be less race-free than I originally thought, I gave up and went
with the hashed mutexes. Either way though, there will still be
i_size issues to sort out.
I'm also not sure that i_mutex will work in any case. I'm pretty sure
I looked for a suitable existing lock, before I created
hugetlb_instantiation_mutex. I'm not at all sure, but I have a nasty
suspicion there's some not-immediately-obvious lock-ordering
constraint that means we can't take i_mutex across the fault.
> Things may not be as bad as I fear, but I've certainly not yet
> emerged into the light on it all.
>
> Hugh
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-10-25 10:09 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-10-25 2:31 [PATCH 1/3] hugetlb: fix size=4G parsing Hugh Dickins
2006-10-25 2:35 ` [PATCH 2/3] hugetlb: fix prio_tree unit Hugh Dickins
2006-10-25 7:08 ` David Gibson
2006-10-25 7:41 ` Hugh Dickins
2006-10-25 23:49 ` Chen, Kenneth W
2006-10-26 3:47 ` David Gibson
2006-10-26 6:15 ` Chen, Kenneth W
2006-10-26 7:55 ` Hugh Dickins
2006-10-26 8:13 ` Hugh Dickins
2006-10-26 10:42 ` David Gibson
2006-10-25 2:38 ` [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd Hugh Dickins
2006-10-25 5:23 ` Mika Penttilä
2006-10-25 5:52 ` David Gibson
2006-10-25 7:27 ` Hugh Dickins
2006-10-25 6:26 ` David Gibson
2006-10-25 6:29 ` David Gibson
2006-10-25 8:39 ` Hugh Dickins
2006-10-25 10:09 ` David Gibson [this message]
2006-10-26 3:59 ` Chen, Kenneth W
2006-10-26 4:13 ` 'David Gibson'
2006-10-26 19:08 ` Christoph Lameter
2006-10-26 19:19 ` Chen, Kenneth W
2006-10-26 20:59 ` Christoph Lameter
2006-10-26 22:19 ` 'David Gibson'
2006-10-25 21:31 ` Adam Litke
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061025100929.GA11040@localhost.localdomain \
--to=david@gibson.dropbear.id.au \
--cc=agl@us.ibm.com \
--cc=akpm@osdl.org \
--cc=hugh@veritas.com \
--cc=kenneth.w.chen@intel.com \
--cc=linux-mm@kvack.org \
--cc=wli@holomorphy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox