Re: [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
To: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: kosaki.motohiro@jp.fujitsu.com, linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>
Subject: Re: [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment
Date: Mon, 18 Aug 2008 18:23:57 +0900	[thread overview]
Message-ID: <20080818164449.60CE.KOSAKI.MOTOHIRO@jp.fujitsu.com> (raw)
In-Reply-To: <1218808477.6373.49.camel@lts-notebook>

Hi Lee-san,

Sorry for late responce.

I understand your intention.
I agreed almost part of your explain, but I have few disagree points.


> I'd like to drop this patch and keep the locked_vm adjustment.  It
> should still handle the error conditions, as a negative 'adjustment'
> from mlock_vma_pages_range() is handled correctly as an error code.
> 
> Now, why keep the adjustment?
> 
> First, I agree that locked_vm tracks amount of locked Virtual Memory, as
> the name indicates.  IMO, this is not a useful metric.  I think what we
> want to track IS the amount of physical memory locked down by the tasks.

hmmm..
Sorry, disagreed.
Various application assume current behavior.

So, I like adding to "locked physical memory" metrics, not replace.


> However, this is difficult.  Consider that today, if one has called,
> say, mlockall(MCL_FUTURE) and then you mmap() some large segment into
> your address space more than once, you'll get charged for the locked_vm
> each time you mmap() the same segment.  One can easily exceed the
> rlimit, even when the actual amount of locked pages is less than the
> limit.  But, as I say, this is difficult to fix efficiently and I don't
> want to try to do that in the context of these patches.

Agreed.
This is mmap()'s bug.


> Now, taking the view that locked_vm counts VM_LOCKED vmas, note that we
> do not set VM_LOCKED for those vmas where it does not make sense to
> mlock the pages with SetPageMlocked().  That is, the "special" vmas,
> that have the following flags:  VM_IO, VM_PFNMAP, VM_RESERVED,
> VM_DONTEXPAND, and VM_HUGETLB -- all of the vmas that we filter out in
> mlock_fixup().  
>
> Now, for vmas with VM_IO or VM_PFNMAP, we don't even attempt to
> "make_pages_present()" because get_user_pages() will error out
> immediately.  For the other types, we do make_pages_present() to avoid
> future minor faults, but we don't set PageMlocked().  Because we don't
> want to have to deal with these vmas during munmap() [including exit
> processing] nor duplicate the vma filtering in that path, we don't set
> the VM_LOCKED flags for these vmas.  Since the vma is not VM_LOCKED, we
> don't want to count the memory as locked_vm.  
>
> For mlock[all](), we do the locked_vm accounting in mlock_fixup().  We
> just don't count these vmas.  See the comments "don't set VM_LOCKED,
> don't count" in the code.  However, mmap() does the locked_vm accounting
> before calling mlock_vma_pages_range() [which replaces calls to
> make_pages_present()].  To make mmap(MAP_LOCKED) behavior consistent
> with mlock[all]() behavior, we return a "locked_vm" adjustment from
> mlock_vma_pages_range().  For the filtered, special vmas, this is always
> the size of the vma which mmap() has already added to locked_vm().  

Agreed.
That is definitly bug. (and this isn't introduced by us)

but current implementation is slightly bad.

mlock_vma_pages_range()'s prototype is ..

	int mlock_vma_pages_range(struct vm_area_struct *vma,
	                        unsigned long start, unsigned long end)

    ==> return type is int.

but sys_mlock()'s prototype is ..

	asmlinkage long sys_mlock(unsigned long start, size_t len)

    ==> len argument's type is size_t.

So, this adjustment code can overflow easily.

maybe, mlock_vma_pages_range()'s type should be changed as bellow

	int mlock_vma_pages_range(struct vm_area_struct *vma,
	                        unsigned long start, unsigned long end,
	                        unsigned long *locked_pages)


but, I think this should do as another patch 
because its bug isn't introduced by us, then it is another change.


So, I like following order changing.

1. remove current adjustment code from split-lru patch series
2. re-introduce it as another patch. (of cource, overflow problem should be fixed)


> So, I'd like to keep the adjustment returned by mlock_vma_pages_range().
> However, as you note, the internal function __mlock_vma_pages_range()
> always returns non-positive values [in earlier versions of this series
> it could return a positive, non-zero value but I decided that was not
> appropriate].  It can, however, return an error code with your changes.
> If we keep the locked_vm adjustment, as I propose here, then we need to
> pass any return value from __mlock_vma_pages_range() via the "nr_pages"
> variable which can contain an error code [< 0] as well as a locked_vm
> adjustment [>= 0].
>
> IMO, we still have some issues in locked_vm accounting that needs to be
> addressed.  E.g., when you attached a hugetlb shm segment, it is
> automatically counted as locked_shm [tracked independently of locked_vm
> in the user struct instead of mm -- go figure!].  However, when we
> mmap() a hugetlbfs file, we don't automatically add this to locked_vm.
> Maybe we should, but not as part of this patch series, I think.

Agreed.

> Does this make sense?
> 
> Maybe I need to explain the rationale better in the unevictable_lru
> documentation.  I do discuss the behavior [locked_vm adjustment], but
> maybe not enough rationale.


Could you tell me your feeling about my concern?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2008-08-18  9:23 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-11  7:01 [RFC PATCH for -mm 0/5] mlock return value rework KOSAKI Motohiro
2008-08-11  7:04 ` [RFC PATCH for -mm 1/5] mlock() fix return values for mainline KOSAKI Motohiro
2008-08-12 20:39   ` Lee Schermerhorn
2008-08-13  8:03     ` KOSAKI Motohiro
2008-08-11  7:05 ` [RFC PATCH for -mm 2/5] related function comment fixes (optional) KOSAKI Motohiro
2008-08-12 19:02   ` Lee Schermerhorn
2008-08-13  8:37     ` KOSAKI Motohiro
2008-08-11  7:06 ` [RFC PATCH for -mm 3/5] kill unnecessary locked_vm adjustment KOSAKI Motohiro
2008-08-12 19:55   ` Lee Schermerhorn
2008-08-13  9:37     ` KOSAKI Motohiro
2008-08-15 13:54       ` Lee Schermerhorn
2008-08-18  9:23         ` KOSAKI Motohiro [this message]
2008-08-18 20:56           ` Lee Schermerhorn
2008-08-11  7:07 ` [RFC PATCH for -mm 4/5] fix mlock return value at munmap race KOSAKI Motohiro
2008-08-12 20:19   ` Lee Schermerhorn
2008-08-11  7:08 ` [RFC PATCH for -mm 5/5] fix mlock return value for mm KOSAKI Motohiro
2008-08-11  7:43   ` KOSAKI Motohiro
2008-08-12 20:30     ` Lee Schermerhorn
2008-08-13  8:36       ` KOSAKI Motohiro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080818164449.60CE.KOSAKI.MOTOHIRO@jp.fujitsu.com \
    --to=kosaki.motohiro@jp.fujitsu.com \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox