From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: linux-mm <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
Rik van Riel <riel@redhat.com>
Subject: Re: [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment
Date: Mon, 18 Aug 2008 16:56:06 -0400 [thread overview]
Message-ID: <1219092966.6232.47.camel@lts-notebook> (raw)
In-Reply-To: <20080818164449.60CE.KOSAKI.MOTOHIRO@jp.fujitsu.com>
On Mon, 2008-08-18 at 18:23 +0900, KOSAKI Motohiro wrote:
> Hi Lee-san,
>
> Sorry for late responce.
>
> I understand your intention.
> I agreed almost part of your explain, but I have few disagree points.
>
>
> > I'd like to drop this patch and keep the locked_vm adjustment. It
> > should still handle the error conditions, as a negative 'adjustment'
> > from mlock_vma_pages_range() is handled correctly as an error code.
> >
> > Now, why keep the adjustment?
> >
> > First, I agree that locked_vm tracks amount of locked Virtual Memory, as
> > the name indicates. IMO, this is not a useful metric. I think what we
> > want to track IS the amount of physical memory locked down by the tasks.
>
> hmmm..
> Sorry, disagreed.
> Various application assume current behavior.
>
> So, I like adding to "locked physical memory" metrics, not replace.
>
>
> > However, this is difficult. Consider that today, if one has called,
> > say, mlockall(MCL_FUTURE) and then you mmap() some large segment into
> > your address space more than once, you'll get charged for the locked_vm
> > each time you mmap() the same segment. One can easily exceed the
> > rlimit, even when the actual amount of locked pages is less than the
> > limit. But, as I say, this is difficult to fix efficiently and I don't
> > want to try to do that in the context of these patches.
>
> Agreed.
> This is mmap()'s bug.
>
>
> > Now, taking the view that locked_vm counts VM_LOCKED vmas, note that we
> > do not set VM_LOCKED for those vmas where it does not make sense to
> > mlock the pages with SetPageMlocked(). That is, the "special" vmas,
> > that have the following flags: VM_IO, VM_PFNMAP, VM_RESERVED,
> > VM_DONTEXPAND, and VM_HUGETLB -- all of the vmas that we filter out in
> > mlock_fixup().
> >
> > Now, for vmas with VM_IO or VM_PFNMAP, we don't even attempt to
> > "make_pages_present()" because get_user_pages() will error out
> > immediately. For the other types, we do make_pages_present() to avoid
> > future minor faults, but we don't set PageMlocked(). Because we don't
> > want to have to deal with these vmas during munmap() [including exit
> > processing] nor duplicate the vma filtering in that path, we don't set
> > the VM_LOCKED flags for these vmas. Since the vma is not VM_LOCKED, we
> > don't want to count the memory as locked_vm.
> >
> > For mlock[all](), we do the locked_vm accounting in mlock_fixup(). We
> > just don't count these vmas. See the comments "don't set VM_LOCKED,
> > don't count" in the code. However, mmap() does the locked_vm accounting
> > before calling mlock_vma_pages_range() [which replaces calls to
> > make_pages_present()]. To make mmap(MAP_LOCKED) behavior consistent
> > with mlock[all]() behavior, we return a "locked_vm" adjustment from
> > mlock_vma_pages_range(). For the filtered, special vmas, this is always
> > the size of the vma which mmap() has already added to locked_vm().
>
> Agreed.
> That is definitly bug. (and this isn't introduced by us)
>
> but current implementation is slightly bad.
>
> mlock_vma_pages_range()'s prototype is ..
>
> int mlock_vma_pages_range(struct vm_area_struct *vma,
> unsigned long start, unsigned long end)
>
> ==> return type is int.
>
> but sys_mlock()'s prototype is ..
>
> asmlinkage long sys_mlock(unsigned long start, size_t len)
>
> ==> len argument's type is size_t.
>
> So, this adjustment code can overflow easily.
>
> maybe, mlock_vma_pages_range()'s type should be changed as bellow
>
> int mlock_vma_pages_range(struct vm_area_struct *vma,
> unsigned long start, unsigned long end,
> unsigned long *locked_pages)
>
>
> but, I think this should do as another patch
> because its bug isn't introduced by us, then it is another change.
>
>
> So, I like following order changing.
>
> 1. remove current adjustment code from split-lru patch series
> 2. re-introduce it as another patch. (of cource, overflow problem should be fixed)
>
>
> > So, I'd like to keep the adjustment returned by mlock_vma_pages_range().
> > However, as you note, the internal function __mlock_vma_pages_range()
> > always returns non-positive values [in earlier versions of this series
> > it could return a positive, non-zero value but I decided that was not
> > appropriate]. It can, however, return an error code with your changes.
> > If we keep the locked_vm adjustment, as I propose here, then we need to
> > pass any return value from __mlock_vma_pages_range() via the "nr_pages"
> > variable which can contain an error code [< 0] as well as a locked_vm
> > adjustment [>= 0].
> >
> > IMO, we still have some issues in locked_vm accounting that needs to be
> > addressed. E.g., when you attached a hugetlb shm segment, it is
> > automatically counted as locked_shm [tracked independently of locked_vm
> > in the user struct instead of mm -- go figure!]. However, when we
> > mmap() a hugetlbfs file, we don't automatically add this to locked_vm.
> > Maybe we should, but not as part of this patch series, I think.
>
> Agreed.
>
> > Does this make sense?
> >
> > Maybe I need to explain the rationale better in the unevictable_lru
> > documentation. I do discuss the behavior [locked_vm adjustment], but
> > maybe not enough rationale.
>
>
> Could you tell me your feeling about my concern?
OK, here's what I propose:
I'll regenerate the patches against the current mmotm, such that they
apply after the related patches. After the patch
"/mmap-handle-mlocked-pages-during-map-remap-unmap.patch", we'll have
the following three patches:
1) back out the locked_vm changes from this patch, so that we can apply
as a separate patch. This #1 patch can [should] be folded into the
"mmap-handle-mlocked-pages..." patch. This is essentially your patch
3/5, reworked to fit into this location.
2) a new patch to handle the locked_vm accounting during mmap() to match
mlock() behavior and the state of VM_LOCKED flags. Essentially, this
will restore the current behavior, with a separate patch description.
I'll also change the nr_pages type to long to alleviate your concern
about overflow. [But note nr_pages != nr_bytes.]
3) your patch, 4/5, to fix the error return of mlock_fixup() when next
vma changes during mlock() with downgraded semaphore.
Next, I think that the Posix mlock() error return issue is separate from
the unevictable mlocked page handling, altho' it touches the same code.
So, I propose two more patches to apply atop 27-rc*-mmotm, as follows:
1) revert the change to make_pages_present() in memory.c and mlock.c
from the mainline to address the mlock() posix error return.
make_pages_present() is not just for mlock(). Altho' all other callers
ignore the return value, we have a "better" place to put the error code
translation in mlock.c now.
2) patch mlock.c in __mlock_vma_pages_range() [both versions] to
translate the error return from get_user_pages/make_pages_present to the
proper Posix value. This patch will only affect mlock() error returns.
I have these patches cooking now, but will not have a change to post
them today. I'll send them out tomorrow.
Regards,
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-08-18 20:56 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-08-11 7:01 [RFC PATCH for -mm 0/5] mlock return value rework KOSAKI Motohiro
2008-08-11 7:04 ` [RFC PATCH for -mm 1/5] mlock() fix return values for mainline KOSAKI Motohiro
2008-08-12 20:39 ` Lee Schermerhorn
2008-08-13 8:03 ` KOSAKI Motohiro
2008-08-11 7:05 ` [RFC PATCH for -mm 2/5] related function comment fixes (optional) KOSAKI Motohiro
2008-08-12 19:02 ` Lee Schermerhorn
2008-08-13 8:37 ` KOSAKI Motohiro
2008-08-11 7:06 ` [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment KOSAKI Motohiro
2008-08-12 19:55 ` Lee Schermerhorn
2008-08-13 9:37 ` KOSAKI Motohiro
2008-08-15 13:54 ` Lee Schermerhorn
2008-08-18 9:23 ` KOSAKI Motohiro
2008-08-18 20:56 ` Lee Schermerhorn [this message]
2008-08-11 7:07 ` [RFC PATCH for -mm 4/5] fix mlock return value at munmap race KOSAKI Motohiro
2008-08-12 20:19 ` Lee Schermerhorn
2008-08-11 7:08 ` [RFC PATCH for -mm 5/5] fix mlock return value for mm KOSAKI Motohiro
2008-08-11 7:43 ` KOSAKI Motohiro
2008-08-12 20:30 ` Lee Schermerhorn
2008-08-13 8:36 ` KOSAKI Motohiro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1219092966.6232.47.camel@lts-notebook \
--to=lee.schermerhorn@hp.com \
--cc=akpm@linux-foundation.org \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox