From: David Rientjes <rientjes@google.com>
To: "Ricardo M. Correia" <ricardo.correia@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, Brian Behlendorf <behlendorf1@llnl.gov>,
Andreas Dilger <andreas.dilger@oracle.com>
Subject: Re: Propagating GFP_NOFS inside __vmalloc()
Date: Mon, 15 Nov 2010 14:50:29 -0800 (PST) [thread overview]
Message-ID: <alpine.DEB.2.00.1011151426360.20468@chino.kir.corp.google.com> (raw)
In-Reply-To: <1289859596.13446.151.camel@oralap>
On Mon, 15 Nov 2010, Ricardo M. Correia wrote:
> > This proposal essentially defines an entirely new method for passing gfp
> > flags to the page allocator when it isn't strictly needed.
>
> I don't see it as a way of passing gfp flags to the page allocator, but
> rather as a way of restricting which gfp flags can be used (on a
> per-thread basis). I agree it's not strictly needed.
>
It restricts the flags that can be used in the current context, but that's
the whole purpose of gfp arguments such as __GFP_FS, __GFP_IO, and
__GFP_WAIT that control the reclaim behavior of the page allocator in the
first place: we want to use all strategies at our disposal that are
allowed in the current context to allocate the memory. Using this
interface to restrict only to __GFP_WAIT, for example, is the same as
passing __GFP_WAIT.
> > The first option really addresses the bug that you're running into and can
> > be addressed in a relatively simple way by redefining current users of
> > pmd_alloc_one(), for instance, as a form of a new lower-level
> > __pmd_alloc_one():
> >
> > static inline pmd_t *__pmd_alloc_one(struct mm_struct *mm,
> > unsigned long addr, gfp_t flags)
> > {
> > return (pmd_t *)get_zeroed_page(flags);
> > }
> >
> > static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
> > {
> > return __pmd_alloc_one(GFP_KERNEL|__GFP_REPEAT);
> > }
> >
> > and then using __pmd_alloc_one() in the vmalloc path with the passed mask
> > rather than pmd_alloc_one().
>
> Ok, this sounds OK in theory (I'd have to change all these
> implementations for all architectures, but oh well..). But I have a
> question about your proposal.. the call chain seems to go:
>
Yes, all architectures would need to be changed because they all
unconditionally do GFP_KERNEL allocations which is buggy in the vmalloc
case, so the breadth of the change is certainly warranted.
> vmap_pmd_range()
> pmd_alloc()
> __pmd_alloc()
> pmd_alloc_one()
>
> So you want to change this so that vmap_pmd_range() calls your new
> __pmd_alloc_one() instead of pmd_alloc_one().
>
Yes, with the gfp flags that were passed to __vmalloc() to restrict the
lower-level allocations that you cited in your original email
(pmd_alloc_one() and pte_alloc_one_kernel()).
> But this means that pmd_alloc() and __pmd_alloc() would need to accept
> the flag as well, right?
>
> Probably we should define an alias for pmd_alloc() so that we don't have
> to change all its callers to pass a flag. But __pmd_alloc() is already
> taken, so what would you suggest we call it? _pmd_alloc()? :-)
>
Yeah, that's why I said it's slightly more intrusive than the single
example I gave, you need to modify the callchain to pass the flags in
other functions as well. Instead of extending the __*() functions with
more underscores like other places in the kernel (see mm/slab.c, for
instance), I'd suggest just appending _gfp() to their name so
__pmd_alloc() uses a new __pmd_alloc_gfp().
> Indeed... I don't think we can do any better in Lustre either. The
> preallocation also doesn't guarantee anything for us, because the amount
> of memory that we may need to allocate can be huge in the worst case,
> and it would be prohibitive to preallocate it.
>
I think it really depends on how much memory you plan to allocate without
allowing direct reclaim (and that also prevents the oom killer from being
used as well). The ntfs use, for instance, falls back to vmalloc if the
size is greater than PAGE_SIZE to avoid high-order allocations from
failing due to fragmentation and memory compaction can't be used for
GFP_NOFS either. The best the VM can do is allocate the order-0 pages
with GFP_NOFS which has a much higher liklihood that it won't succeed, so
it's really to gfs2, ntfs, and ceph's detriment that it doesn't have a
workaround.
> For our case, I'd think it's better to either handle failure or somehow
> retry until the allocation succeeds (if we know for sure that it will,
> eventually).
>
If your use-case is going to block until this memory is available, there's
a serious problem that you'll need to address because nothing is going to
guarantee that memory will be freed unless something else is trying to
allocate memory and pages get written back or something gets killed as a
result. Strictly relying on that behavior is concerning, but it's not
something that can be fixed in the VM.
> > but the liklihood of those allocations failing is much
> > harder than the typical vmalloc() that tries really hard with __GFP_REPEAT
> > to allocate the memory required.
>
> Not sure what do you mean by this.. I don't see a typical vmalloc()
> using __GFP_REPEAT anywhere (apart from functions such as
> pmd_alloc_one(), which in the code above you suggested to keep passing
> __GFP_REPEAT).. am I missing something?
>
__GFP_REPEAT will retry the allocation indefinitely until the needed
amount of memory is reclaimed without considering the order of the
allocation; all orders of interest in your case are order-0, so it will
loop indefinitely until a single page is reclaimed which won't happen with
GFP_NOFS. Thus, passing the flag is the equivalent of asking the
allocator to loop forever until memory is available rather than failing
and returning to your error handling.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-11-15 22:50 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-10 20:42 Ricardo M. Correia
2010-11-10 21:35 ` Ricardo M. Correia
2010-11-10 22:10 ` Dave Chinner
2010-11-11 20:06 ` Andrew Morton
2010-11-11 22:02 ` Ricardo M. Correia
2010-11-11 22:25 ` Andrew Morton
2010-11-11 22:45 ` Ricardo M. Correia
2010-11-11 23:19 ` Ricardo M. Correia
2010-11-11 23:27 ` Andrew Morton
2010-11-11 23:29 ` Ricardo M. Correia
2010-11-15 17:01 ` Ricardo M. Correia
2010-11-15 21:28 ` David Rientjes
2010-11-15 22:19 ` Ricardo M. Correia
2010-11-15 22:50 ` David Rientjes [this message]
2010-11-15 23:30 ` Ricardo M. Correia
2010-11-15 23:55 ` David Rientjes
2010-11-16 22:11 ` Andrew Morton
2010-11-17 7:18 ` Andreas Dilger
2010-11-17 7:24 ` Andrew Morton
2010-11-17 7:37 ` David Rientjes
2010-11-17 9:04 ` Christoph Hellwig
2010-11-17 21:24 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.DEB.2.00.1011151426360.20468@chino.kir.corp.google.com \
--to=rientjes@google.com \
--cc=akpm@linux-foundation.org \
--cc=andreas.dilger@oracle.com \
--cc=behlendorf1@llnl.gov \
--cc=linux-mm@kvack.org \
--cc=ricardo.correia@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox