[PATCH] mm: make munlock fast when mlock is canceled by sigkill

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: make munlock fast when mlock is canceled by sigkill
@ 2009-08-22 16:54 Hiroaki Wakabayashi
  2009-08-24  1:44 ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Hiroaki Wakabayashi @ 2009-08-22 16:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, Paul Menage, Ying Han, KOSAKI Motohiro,
	Pekka Enberg, Lee Schermerhorn



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
  2009-08-22 16:54 [PATCH] mm: make munlock fast when mlock is canceled by sigkill Hiroaki Wakabayashi
@ 2009-08-24  1:44 ` Minchan Kim
  2009-08-24  1:51   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2009-08-24  1:44 UTC (permalink / raw)
  To: Hiroaki Wakabayashi, KAMEZAWA Hiroyuki
  Cc: Andrew Morton, LKML, linux-mm, Paul Menage, Ying Han,
	KOSAKI Motohiro, Pekka Enberg, Lee Schermerhorn

On Sun, Aug 23, 2009 at 1:54 AM, Hiroaki
Wakabayashi<primulaelatior@gmail.com> wrote:
> From 27b2fde0222c59049026e7d0bdc4a2a68d0720f5 Mon Sep 17 00:00:00 2001
> From: Hiroaki Wakabayashi <primulaelatior@gmail.com>
> Date: Sat, 22 Aug 2009 19:14:53 +0900
> Subject: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
>
> This patch is for making commit 4779280d1e (mm: make get_user_pages()
> interruptible) complete.
>
> At first, munlock() assumes that all pages in vma are pinned,
>
> Now, by the commit, mlock() can be interrupted by SIGKILL, etc  So, part of
> pages are not pinned.
> If SIGKILL, In exit() path, munlock is called for unlocking pinned pages
> in vma.
>
> But, there, get_user_pages(write) is used for munlock(). Then, pages are
> allocated via page-fault for exsiting process !!! This is problem at canceling
> big mlock.
> This patch tries to avoid allocating new pages at munlock().
>
>   mlock( big area )
>        <===== sig kill
>   do_exit()
>    ->mmput()
>       -> do_munlock()
>         -> get_user_pages()
>               <allocate *never used* memory>
>       ->.....freeing allocated memory.
>
> * Test program
> % cat run.sh
> #!/bin/sh
>
> ./mlock_test 2000000000 &
> sleep 2
> kill -9 $!
> wait
>
> % cat mlock_test.c
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <time.h>
> #include <unistd.h>
> #include <sys/time.h>
>
> int main(int argc, char **argv)
> {
>        size_t length = 50 * 1024 * 1024;
>        void *addr;
>        time_t timer;
>
>        if (argc >= 2)
>                length = strtoul(argv[1], NULL, 10);
>        printf("PID = %d\n", getpid());
>        addr = mmap(NULL, length, PROT_READ | PROT_WRITE,
>                                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>        if (addr == MAP_FAILED) {
>                fprintf(stderr, "mmap failed: %s, length=%lu\n",
>                                strerror(errno), length);
>                exit(EXIT_FAILURE);
>        }
>        printf("try mlock length=%lu\n", length);
>        timer = time(NULL);
>        if (mlock(addr, length) < 0) {
>                fprintf(stderr, "mlock failed: %s, time=%lu[sec]\n",
>                                strerror(errno), time(NULL) - timer);
>                exit(EXIT_FAILURE);
>        }
>        printf("mlock succeed, time=%lu[sec]\n\n", time(NULL) - timer);
>        printf("try munlock length=%lu\n", length);
>        timer = time(NULL);
>        if (munlock(addr, length) < 0) {
>                fprintf(stderr, "munlock failed: %s, time=%lu[sec]\n",
>                                strerror(errno), time(NULL)-timer);
>                exit(EXIT_FAILURE);
>        }
>        printf("munlock succeed, time=%lu[sec]\n\n", time(NULL) - timer);
>        if (munmap(addr, length) < 0) {
>                fprintf(stderr, "munmap failed: %s\n", strerror(errno));
>                exit(EXIT_FAILURE);
>        }
>        return 0;
> }
>
> * Executed Result
> -- Original executed result
> % time ./run.sh
>
> PID = 2678
> try mlock length=2000000000
> ./run.sh: line 6:  2678 Killed                  ./mlock_test 2000000000
> ./run.sh  0.00s user 2.59s system 13% cpu 18.781 total
> %
>
> -- After applied this patch
> % time ./run.sh
>
> PID = 2512
> try mlock length=2000000000
> ./run.sh: line 6:  2512 Killed                  ./mlock_test 2000000000
> ./run.sh  0.00s user 1.15s system 45% cpu 2.507 total
> %
>
> Signed-off-by: Hiroaki Wakabayashi <primulaelatior@gmail.com>
> ---
>  mm/internal.h |    1 +
>  mm/memory.c   |    9 +++++++--
>  mm/mlock.c    |   35 +++++++++++++++++++----------------
>  3 files changed, 27 insertions(+), 18 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index f290c4d..4ab5b24 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -254,6 +254,7 @@ static inline void
> mminit_validate_memmodel_limits(unsigned long *start_pfn,
>  #define GUP_FLAGS_FORCE                  0x2
>  #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
>  #define GUP_FLAGS_IGNORE_SIGKILL         0x8
> +#define GUP_FLAGS_ALLOW_NULL             0x10
>

I am worried about adding new flag whenever we need it.
But I think this case makes sense to me.
In addition, I guess ZERO page can also use this flag.

Kame. What do you think about it?


>  int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>                     unsigned long start, int len, int flags,
> diff --git a/mm/memory.c b/mm/memory.c
> index aede2ce..b41fbf9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1217,6 +1217,7 @@ int __get_user_pages(struct task_struct *tsk,
> struct mm_struct *mm,
>        int force = !!(flags & GUP_FLAGS_FORCE);
>        int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
>        int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
> +       int allow_null = !!(flags & GUP_FLAGS_ALLOW_NULL);
>
>        if (nr_pages <= 0)
>                return 0;
> @@ -1312,6 +1313,8 @@ int __get_user_pages(struct task_struct *tsk,
> struct mm_struct *mm,
>                        while (!(page = follow_page(vma, start, foll_flags))) {
>                                int ret;
>
> +                               if (allow_null)
> +                                       break;
>                                ret = handle_mm_fault(mm, vma, start,
>                                        (foll_flags & FOLL_WRITE) ?
>                                        FAULT_FLAG_WRITE : 0);
> @@ -1351,8 +1354,10 @@ int __get_user_pages(struct task_struct *tsk,
> struct mm_struct *mm,
>                        if (pages) {
>                                pages[i] = page;
>
> -                               flush_anon_page(vma, page, start);
> -                               flush_dcache_page(page);
> +                               if (page) {
> +                                       flush_anon_page(vma, page, start);
> +                                       flush_dcache_page(page);
> +                               }
>                        }
>                        if (vmas)
>                                vmas[i] = vma;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 45eb650..0f5827b 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -178,9 +178,10 @@ static long __mlock_vma_pages_range(struct
> vm_area_struct *vma,
>         */
>        if (!mlock)
>                gup_flags |= GUP_FLAGS_IGNORE_VMA_PERMISSIONS |
> -                            GUP_FLAGS_IGNORE_SIGKILL;
> +                            GUP_FLAGS_IGNORE_SIGKILL |
> +                            GUP_FLAGS_ALLOW_NULL;
>
> -       if (vma->vm_flags & VM_WRITE)
> +       if (mlock && (vma->vm_flags & VM_WRITE))
>                gup_flags |= GUP_FLAGS_WRITE;
>
>        while (nr_pages > 0) {
> @@ -220,21 +221,23 @@ static long __mlock_vma_pages_range(struct
> vm_area_struct *vma,
>                for (i = 0; i < ret; i++) {
>                        struct page *page = pages[i];
>
> -                       lock_page(page);
> -                       /*
> -                        * Because we lock page here and migration is blocked
> -                        * by the elevated reference, we need only check for
> -                        * page truncation (file-cache only).
> -                        */
> -                       if (page->mapping) {
> -                               if (mlock)
> -                                       mlock_vma_page(page);
> -                               else
> -                                       munlock_vma_page(page);
> +                       if (page) {
> +                               lock_page(page);
> +                               /*
> +                                * Because we lock page here and migration is
> +                                * blocked by the elevated reference, we need
> +                                * only check for page truncation
> +                                * (file-cache only).
> +                                */
> +                               if (page->mapping) {
> +                                       if (mlock)
> +                                               mlock_vma_page(page);
> +                                       else
> +                                               munlock_vma_page(page);
> +                               }
> +                               unlock_page(page);
> +                               put_page(page); /* ref from get_user_pages() */
>                        }
> -                       unlock_page(page);
> -                       put_page(page);         /* ref from get_user_pages() */
> -
>                        /*
>                         * here we assume that get_user_pages() has given us
>                         * a list of virtually contiguous pages.
> --
> 1.5.6.5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
  2009-08-24  1:44 ` Minchan Kim
@ 2009-08-24  1:51   ` KAMEZAWA Hiroyuki
  2009-08-24  2:23     ` Minchan Kim
  2009-08-24  4:13     ` KOSAKI Motohiro
  0 siblings, 2 replies; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-24  1:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hiroaki Wakabayashi, Andrew Morton, LKML, linux-mm, Paul Menage,
	Ying Han, KOSAKI Motohiro, Pekka Enberg, Lee Schermerhorn

On Mon, 24 Aug 2009 10:44:41 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Sun, Aug 23, 2009 at 1:54 AM, Hiroaki
> Wakabayashi<primulaelatior@gmail.com> wrote:
> > From 27b2fde0222c59049026e7d0bdc4a2a68d0720f5 Mon Sep 17 00:00:00 2001
> > From: Hiroaki Wakabayashi <primulaelatior@gmail.com>
> > Date: Sat, 22 Aug 2009 19:14:53 +0900
> > Subject: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
> >
> > This patch is for making commit 4779280d1e (mm: make get_user_pages()
> > interruptible) complete.
> >
> > At first, munlock() assumes that all pages in vma are pinned,
> >
> > Now, by the commit, mlock() can be interrupted by SIGKILL, etc A So, part of
> > pages are not pinned.
> > If SIGKILL, In exit() path, munlock is called for unlocking pinned pages
> > in vma.
> >
> > But, there, get_user_pages(write) is used for munlock(). Then, pages are
> > allocated via page-fault for exsiting process !!! This is problem at canceling
> > big mlock.
> > This patch tries to avoid allocating new pages at munlock().
> >
> > A  mlock( big area )
> > A  A  A  A <===== sig kill
> > A  do_exit()
> > A  A ->mmput()
> > A  A  A  -> do_munlock()
> > A  A  A  A  -> get_user_pages()
> > A  A  A  A  A  A  A  <allocate *never used* memory>
> > A  A  A  ->.....freeing allocated memory.
> >
> > * Test program
> > % cat run.sh
> > #!/bin/sh
> >
> > ./mlock_test 2000000000 &
> > sleep 2
> > kill -9 $!
> > wait
> >
> > % cat mlock_test.c
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <sys/mman.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <errno.h>
> > #include <time.h>
> > #include <unistd.h>
> > #include <sys/time.h>
> >
> > int main(int argc, char **argv)
> > {
> > A  A  A  A size_t length = 50 * 1024 * 1024;
> > A  A  A  A void *addr;
> > A  A  A  A time_t timer;
> >
> > A  A  A  A if (argc >= 2)
> > A  A  A  A  A  A  A  A length = strtoul(argv[1], NULL, 10);
> > A  A  A  A printf("PID = %d\n", getpid());
> > A  A  A  A addr = mmap(NULL, length, PROT_READ | PROT_WRITE,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > A  A  A  A if (addr == MAP_FAILED) {
> > A  A  A  A  A  A  A  A fprintf(stderr, "mmap failed: %s, length=%lu\n",
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A strerror(errno), length);
> > A  A  A  A  A  A  A  A exit(EXIT_FAILURE);
> > A  A  A  A }
> > A  A  A  A printf("try mlock length=%lu\n", length);
> > A  A  A  A timer = time(NULL);
> > A  A  A  A if (mlock(addr, length) < 0) {
> > A  A  A  A  A  A  A  A fprintf(stderr, "mlock failed: %s, time=%lu[sec]\n",
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A strerror(errno), time(NULL) - timer);
> > A  A  A  A  A  A  A  A exit(EXIT_FAILURE);
> > A  A  A  A }
> > A  A  A  A printf("mlock succeed, time=%lu[sec]\n\n", time(NULL) - timer);
> > A  A  A  A printf("try munlock length=%lu\n", length);
> > A  A  A  A timer = time(NULL);
> > A  A  A  A if (munlock(addr, length) < 0) {
> > A  A  A  A  A  A  A  A fprintf(stderr, "munlock failed: %s, time=%lu[sec]\n",
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A strerror(errno), time(NULL)-timer);
> > A  A  A  A  A  A  A  A exit(EXIT_FAILURE);
> > A  A  A  A }
> > A  A  A  A printf("munlock succeed, time=%lu[sec]\n\n", time(NULL) - timer);
> > A  A  A  A if (munmap(addr, length) < 0) {
> > A  A  A  A  A  A  A  A fprintf(stderr, "munmap failed: %s\n", strerror(errno));
> > A  A  A  A  A  A  A  A exit(EXIT_FAILURE);
> > A  A  A  A }
> > A  A  A  A return 0;
> > }
> >
> > * Executed Result
> > -- Original executed result
> > % time ./run.sh
> >
> > PID = 2678
> > try mlock length=2000000000
> > ./run.sh: line 6: A 2678 Killed A  A  A  A  A  A  A  A  A ./mlock_test 2000000000
> > ./run.sh A 0.00s user 2.59s system 13% cpu 18.781 total
> > %
> >
> > -- After applied this patch
> > % time ./run.sh
> >
> > PID = 2512
> > try mlock length=2000000000
> > ./run.sh: line 6: A 2512 Killed A  A  A  A  A  A  A  A  A ./mlock_test 2000000000
> > ./run.sh A 0.00s user 1.15s system 45% cpu 2.507 total
> > %
> >
> > Signed-off-by: Hiroaki Wakabayashi <primulaelatior@gmail.com>
> > ---
> > A mm/internal.h | A  A 1 +
> > A mm/memory.c A  | A  A 9 +++++++--
> > A mm/mlock.c A  A | A  35 +++++++++++++++++++----------------
> > A 3 files changed, 27 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index f290c4d..4ab5b24 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -254,6 +254,7 @@ static inline void
> > mminit_validate_memmodel_limits(unsigned long *start_pfn,
> > A #define GUP_FLAGS_FORCE A  A  A  A  A  A  A  A  A 0x2
> > A #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
> > A #define GUP_FLAGS_IGNORE_SIGKILL A  A  A  A  0x8
> > +#define GUP_FLAGS_ALLOW_NULL A  A  A  A  A  A  0x10
> >
> 
> I am worried about adding new flag whenever we need it.
> But I think this case makes sense to me.
> In addition, I guess ZERO page can also use this flag.
> 
> Kame. What do you think about it?
> 
I do welcome this !
Then, I don't have to take care of mlock/munlock in ZERO_PAGE patch.
 
And without this patch, munlock() does copy-on-write just for unpinning memory.
So, this patch shows some right direction, I think.

One concern is flag name, ALLOW_NULL sounds not very good.

  GUP_FLAGS_NOFAULT ?

I wonder we can remove a hack of FOLL_ANON for core-dump by this flag, too.

Thanks,
-Kame


> 
> > A int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > A  A  A  A  A  A  A  A  A  A  unsigned long start, int len, int flags,
> > diff --git a/mm/memory.c b/mm/memory.c
> > index aede2ce..b41fbf9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1217,6 +1217,7 @@ int __get_user_pages(struct task_struct *tsk,
> > struct mm_struct *mm,
> > A  A  A  A int force = !!(flags & GUP_FLAGS_FORCE);
> > A  A  A  A int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
> > A  A  A  A int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
> > + A  A  A  int allow_null = !!(flags & GUP_FLAGS_ALLOW_NULL);
> >
> > A  A  A  A if (nr_pages <= 0)
> > A  A  A  A  A  A  A  A return 0;
> > @@ -1312,6 +1313,8 @@ int __get_user_pages(struct task_struct *tsk,
> > struct mm_struct *mm,
> > A  A  A  A  A  A  A  A  A  A  A  A while (!(page = follow_page(vma, start, foll_flags))) {
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A int ret;
> >
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  if (allow_null)
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  break;
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A ret = handle_mm_fault(mm, vma, start,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A (foll_flags & FOLL_WRITE) ?
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A FAULT_FLAG_WRITE : 0);
> > @@ -1351,8 +1354,10 @@ int __get_user_pages(struct task_struct *tsk,
> > struct mm_struct *mm,
> > A  A  A  A  A  A  A  A  A  A  A  A if (pages) {
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A pages[i] = page;
> >
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  flush_anon_page(vma, page, start);
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  flush_dcache_page(page);
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  if (page) {
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  flush_anon_page(vma, page, start);
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  flush_dcache_page(page);
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  }
> > A  A  A  A  A  A  A  A  A  A  A  A }
> > A  A  A  A  A  A  A  A  A  A  A  A if (vmas)
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A vmas[i] = vma;
> > diff --git a/mm/mlock.c b/mm/mlock.c
> > index 45eb650..0f5827b 100644
> > --- a/mm/mlock.c
> > +++ b/mm/mlock.c
> > @@ -178,9 +178,10 @@ static long __mlock_vma_pages_range(struct
> > vm_area_struct *vma,
> > A  A  A  A  */
> > A  A  A  A if (!mlock)
> > A  A  A  A  A  A  A  A gup_flags |= GUP_FLAGS_IGNORE_VMA_PERMISSIONS |
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A GUP_FLAGS_IGNORE_SIGKILL;
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A GUP_FLAGS_IGNORE_SIGKILL |
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A GUP_FLAGS_ALLOW_NULL;
> >
> > - A  A  A  if (vma->vm_flags & VM_WRITE)
> > + A  A  A  if (mlock && (vma->vm_flags & VM_WRITE))
> > A  A  A  A  A  A  A  A gup_flags |= GUP_FLAGS_WRITE;
> >
> > A  A  A  A while (nr_pages > 0) {
> > @@ -220,21 +221,23 @@ static long __mlock_vma_pages_range(struct
> > vm_area_struct *vma,
> > A  A  A  A  A  A  A  A for (i = 0; i < ret; i++) {
> > A  A  A  A  A  A  A  A  A  A  A  A struct page *page = pages[i];
> >
> > - A  A  A  A  A  A  A  A  A  A  A  lock_page(page);
> > - A  A  A  A  A  A  A  A  A  A  A  /*
> > - A  A  A  A  A  A  A  A  A  A  A  A * Because we lock page here and migration is blocked
> > - A  A  A  A  A  A  A  A  A  A  A  A * by the elevated reference, we need only check for
> > - A  A  A  A  A  A  A  A  A  A  A  A * page truncation (file-cache only).
> > - A  A  A  A  A  A  A  A  A  A  A  A */
> > - A  A  A  A  A  A  A  A  A  A  A  if (page->mapping) {
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  if (mlock)
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  mlock_vma_page(page);
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  else
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  munlock_vma_page(page);
> > + A  A  A  A  A  A  A  A  A  A  A  if (page) {
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  lock_page(page);
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  /*
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * Because we lock page here and migration is
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * blocked by the elevated reference, we need
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * only check for page truncation
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * (file-cache only).
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A */
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  if (page->mapping) {
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  if (mlock)
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  mlock_vma_page(page);
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  else
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  munlock_vma_page(page);
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  }
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  unlock_page(page);
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  put_page(page); /* ref from get_user_pages() */
> > A  A  A  A  A  A  A  A  A  A  A  A }
> > - A  A  A  A  A  A  A  A  A  A  A  unlock_page(page);
> > - A  A  A  A  A  A  A  A  A  A  A  put_page(page); A  A  A  A  /* ref from get_user_pages() */
> > -
> > A  A  A  A  A  A  A  A  A  A  A  A /*
> > A  A  A  A  A  A  A  A  A  A  A  A  * here we assume that get_user_pages() has given us
> > A  A  A  A  A  A  A  A  A  A  A  A  * a list of virtually contiguous pages.
> > --
> > 1.5.6.5
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at A http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at A http://www.tux.org/lkml/
> >
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
  2009-08-24  1:51   ` KAMEZAWA Hiroyuki
@ 2009-08-24  2:23     ` Minchan Kim
  2009-08-24  4:13     ` KOSAKI Motohiro
  1 sibling, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2009-08-24  2:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hiroaki Wakabayashi, Andrew Morton, LKML, linux-mm, Paul Menage,
	Ying Han, KOSAKI Motohiro, Pekka Enberg, Lee Schermerhorn

On Mon, Aug 24, 2009 at 10:51 AM, KAMEZAWA
Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 24 Aug 2009 10:44:41 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Sun, Aug 23, 2009 at 1:54 AM, Hiroaki
>> Wakabayashi<primulaelatior@gmail.com> wrote:
>> > From 27b2fde0222c59049026e7d0bdc4a2a68d0720f5 Mon Sep 17 00:00:00 2001
>> > From: Hiroaki Wakabayashi <primulaelatior@gmail.com>
>> > Date: Sat, 22 Aug 2009 19:14:53 +0900
>> > Subject: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
>> >
>> > This patch is for making commit 4779280d1e (mm: make get_user_pages()
>> > interruptible) complete.
>> >
>> > At first, munlock() assumes that all pages in vma are pinned,
>> >
>> > Now, by the commit, mlock() can be interrupted by SIGKILL, etc  So, part of
>> > pages are not pinned.
>> > If SIGKILL, In exit() path, munlock is called for unlocking pinned pages
>> > in vma.
>> >
>> > But, there, get_user_pages(write) is used for munlock(). Then, pages are
>> > allocated via page-fault for exsiting process !!! This is problem at canceling
>> > big mlock.
>> > This patch tries to avoid allocating new pages at munlock().
>> >
>> >   mlock( big area )
>> >        <===== sig kill
>> >   do_exit()
>> >    ->mmput()
>> >       -> do_munlock()
>> >         -> get_user_pages()
>> >               <allocate *never used* memory>
>> >       ->.....freeing allocated memory.
>> >
>> > * Test program
>> > % cat run.sh
>> > #!/bin/sh
>> >
>> > ./mlock_test 2000000000 &
>> > sleep 2
>> > kill -9 $!
>> > wait
>> >
>> > % cat mlock_test.c
>> > #include <stdio.h>
>> > #include <stdlib.h>
>> > #include <string.h>
>> > #include <sys/mman.h>
>> > #include <sys/types.h>
>> > #include <sys/stat.h>
>> > #include <fcntl.h>
>> > #include <errno.h>
>> > #include <time.h>
>> > #include <unistd.h>
>> > #include <sys/time.h>
>> >
>> > int main(int argc, char **argv)
>> > {
>> >        size_t length = 50 * 1024 * 1024;
>> >        void *addr;
>> >        time_t timer;
>> >
>> >        if (argc >= 2)
>> >                length = strtoul(argv[1], NULL, 10);
>> >        printf("PID = %d\n", getpid());
>> >        addr = mmap(NULL, length, PROT_READ | PROT_WRITE,
>> >                                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> >        if (addr == MAP_FAILED) {
>> >                fprintf(stderr, "mmap failed: %s, length=%lu\n",
>> >                                strerror(errno), length);
>> >                exit(EXIT_FAILURE);
>> >        }
>> >        printf("try mlock length=%lu\n", length);
>> >        timer = time(NULL);
>> >        if (mlock(addr, length) < 0) {
>> >                fprintf(stderr, "mlock failed: %s, time=%lu[sec]\n",
>> >                                strerror(errno), time(NULL) - timer);
>> >                exit(EXIT_FAILURE);
>> >        }
>> >        printf("mlock succeed, time=%lu[sec]\n\n", time(NULL) - timer);
>> >        printf("try munlock length=%lu\n", length);
>> >        timer = time(NULL);
>> >        if (munlock(addr, length) < 0) {
>> >                fprintf(stderr, "munlock failed: %s, time=%lu[sec]\n",
>> >                                strerror(errno), time(NULL)-timer);
>> >                exit(EXIT_FAILURE);
>> >        }
>> >        printf("munlock succeed, time=%lu[sec]\n\n", time(NULL) - timer);
>> >        if (munmap(addr, length) < 0) {
>> >                fprintf(stderr, "munmap failed: %s\n", strerror(errno));
>> >                exit(EXIT_FAILURE);
>> >        }
>> >        return 0;
>> > }
>> >
>> > * Executed Result
>> > -- Original executed result
>> > % time ./run.sh
>> >
>> > PID = 2678
>> > try mlock length=2000000000
>> > ./run.sh: line 6:  2678 Killed                  ./mlock_test 2000000000
>> > ./run.sh  0.00s user 2.59s system 13% cpu 18.781 total
>> > %
>> >
>> > -- After applied this patch
>> > % time ./run.sh
>> >
>> > PID = 2512
>> > try mlock length=2000000000
>> > ./run.sh: line 6:  2512 Killed                  ./mlock_test 2000000000
>> > ./run.sh  0.00s user 1.15s system 45% cpu 2.507 total
>> > %
>> >
>> > Signed-off-by: Hiroaki Wakabayashi <primulaelatior@gmail.com>
>> > ---
>> >  mm/internal.h |    1 +
>> >  mm/memory.c   |    9 +++++++--
>> >  mm/mlock.c    |   35 +++++++++++++++++++----------------
>> >  3 files changed, 27 insertions(+), 18 deletions(-)
>> >
>> > diff --git a/mm/internal.h b/mm/internal.h
>> > index f290c4d..4ab5b24 100644
>> > --- a/mm/internal.h
>> > +++ b/mm/internal.h
>> > @@ -254,6 +254,7 @@ static inline void
>> > mminit_validate_memmodel_limits(unsigned long *start_pfn,
>> >  #define GUP_FLAGS_FORCE                  0x2
>> >  #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
>> >  #define GUP_FLAGS_IGNORE_SIGKILL         0x8
>> > +#define GUP_FLAGS_ALLOW_NULL             0x10
>> >
>>
>> I am worried about adding new flag whenever we need it.
>> But I think this case makes sense to me.
>> In addition, I guess ZERO page can also use this flag.
>>
>> Kame. What do you think about it?
>>
> I do welcome this !
> Then, I don't have to take care of mlock/munlock in ZERO_PAGE patch.
>
> And without this patch, munlock() does copy-on-write just for unpinning memory.
> So, this patch shows some right direction, I think.
>
> One concern is flag name, ALLOW_NULL sounds not very good.
>
>  GUP_FLAGS_NOFAULT ?
>
> I wonder we can remove a hack of FOLL_ANON for core-dump by this flag, too.

That's a good point.
It can remove little cache footprint and
unnecessary calls[flush_xxx_page in GUP].

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
  2009-08-24  1:51   ` KAMEZAWA Hiroyuki
  2009-08-24  2:23     ` Minchan Kim
@ 2009-08-24  4:13     ` KOSAKI Motohiro
  2009-08-25  4:46       ` Hiroaki Wakabayashi
  1 sibling, 1 reply; 12+ messages in thread
From: KOSAKI Motohiro @ 2009-08-24  4:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Hiroaki Wakabayashi, Andrew Morton, LKML, linux-mm,
	Paul Menage, Ying Han, Pekka Enberg, Lee Schermerhorn

> This patch is for making commit 4779280d1e (mm: make get_user_pages()
> interruptible) complete.

Yes.
commit 4779280d1e (mm: make get_user_pages() interruptible) has never
works as expected since it's born.
IOW, it was totally broken.

This patch is definitely good forward step patch.


>> > @@ -254,6 +254,7 @@ static inline void
>> > mminit_validate_memmodel_limits(unsigned long *start_pfn,
>> >  #define GUP_FLAGS_FORCE                  0x2
>> >  #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
>> >  #define GUP_FLAGS_IGNORE_SIGKILL         0x8
>> > +#define GUP_FLAGS_ALLOW_NULL             0x10
>> >
>>
>> I am worried about adding new flag whenever we need it.
>> But I think this case makes sense to me.
>> In addition, I guess ZERO page can also use this flag.
>>
>> Kame. What do you think about it?
>>
> I do welcome this !
> Then, I don't have to take care of mlock/munlock in ZERO_PAGE patch.
>
> And without this patch, munlock() does copy-on-write just for unpinning memory.
> So, this patch shows some right direction, I think.
>
> One concern is flag name, ALLOW_NULL sounds not very good.
>
>  GUP_FLAGS_NOFAULT ?
>
> I wonder we can remove a hack of FOLL_ANON for core-dump by this flag, too.

Yeah, GUP_FLAGS_NOFAULT is better.

Plus, this patch change __get_user_pages() return value meaning IOW.
after this patch, it can return following value,

 return value: 3
 pages[0]: hoge-page
 pages[1]: null
 pages[2]: fuga-page

but, it can be

 return value: 2
 pages[0]: hoge-page
 pages[1]: fuga-page

no?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
  2009-08-24  4:13     ` KOSAKI Motohiro
@ 2009-08-25  4:46       ` Hiroaki Wakabayashi
  2009-08-25  5:39         ` Minchan Kim
  2009-08-25  9:03         ` Hugh Dickins
  0 siblings, 2 replies; 12+ messages in thread
From: Hiroaki Wakabayashi @ 2009-08-25  4:46 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Minchan Kim, Andrew Morton, LKML, linux-mm,
	Paul Menage, Ying Han, Pekka Enberg, Lee Schermerhorn

Thank you for reviews.

>>> > @@ -254,6 +254,7 @@ static inline void
>>> > mminit_validate_memmodel_limits(unsigned long *start_pfn,
>>> >  #define GUP_FLAGS_FORCE                  0x2
>>> >  #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
>>> >  #define GUP_FLAGS_IGNORE_SIGKILL         0x8
>>> > +#define GUP_FLAGS_ALLOW_NULL             0x10
>>> >
>>>
>>> I am worried about adding new flag whenever we need it.
>>> But I think this case makes sense to me.
>>> In addition, I guess ZERO page can also use this flag.
>>>
>>> Kame. What do you think about it?
>>>
>> I do welcome this !
>> Then, I don't have to take care of mlock/munlock in ZERO_PAGE patch.
>>
>> And without this patch, munlock() does copy-on-write just for unpinning memory.
>> So, this patch shows some right direction, I think.
>>
>> One concern is flag name, ALLOW_NULL sounds not very good.
>>
>>  GUP_FLAGS_NOFAULT ?
>>
>> I wonder we can remove a hack of FOLL_ANON for core-dump by this flag, too.
>
> Yeah, GUP_FLAGS_NOFAULT is better.

Me too.
I will change this flag name.

> Plus, this patch change __get_user_pages() return value meaning IOW.
> after this patch, it can return following value,
>
>  return value: 3
>  pages[0]: hoge-page
>  pages[1]: null
>  pages[2]: fuga-page
>
> but, it can be
>
>  return value: 2
>  pages[0]: hoge-page
>  pages[1]: fuga-page
>
> no?

I did misunderstand mean of get_user_pages()'s return value.

When I try to change __get_user_pages(), I got problem.
If remove NULLs from pages,
__mlock_vma_pages_range() cannot know how long __get_user_pages() readed.
So, I have to get the virtual address of the page from vma and page.
Because __mlock_vma_pages_range() have to call
__get_user_pages() many times with different `start' argument.

I try to use page_address_in_vma(), but it failed.
(page_address_in_vma() returned -EFAULT)
I cannot find way to solve this problem.
Are there good ideas?
Please give me some ideas.

Thanks.
--
Hiroaki Wakabayashi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
  2009-08-25  4:46       ` Hiroaki Wakabayashi
@ 2009-08-25  5:39         ` Minchan Kim
  2009-08-25  9:03         ` Hugh Dickins
  1 sibling, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2009-08-25  5:39 UTC (permalink / raw)
  To: Hiroaki Wakabayashi
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Minchan Kim, Andrew Morton,
	LKML, linux-mm, Paul Menage, Ying Han, Pekka Enberg,
	Lee Schermerhorn

On Tue, 25 Aug 2009 13:46:19 +0900
Hiroaki Wakabayashi <primulaelatior@gmail.com> wrote:

> Thank you for reviews.
> 
> >>> > @@ -254,6 +254,7 @@ static inline void
> >>> > mminit_validate_memmodel_limits(unsigned long *start_pfn,
> >>> > A #define GUP_FLAGS_FORCE A  A  A  A  A  A  A  A  A 0x2
> >>> > A #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
> >>> > A #define GUP_FLAGS_IGNORE_SIGKILL A  A  A  A  0x8
> >>> > +#define GUP_FLAGS_ALLOW_NULL A  A  A  A  A  A  0x10
> >>> >
> >>>
> >>> I am worried about adding new flag whenever we need it.
> >>> But I think this case makes sense to me.
> >>> In addition, I guess ZERO page can also use this flag.
> >>>
> >>> Kame. What do you think about it?
> >>>
> >> I do welcome this !
> >> Then, I don't have to take care of mlock/munlock in ZERO_PAGE patch.
> >>
> >> And without this patch, munlock() does copy-on-write just for unpinning memory.
> >> So, this patch shows some right direction, I think.
> >>
> >> One concern is flag name, ALLOW_NULL sounds not very good.
> >>
> >> A GUP_FLAGS_NOFAULT ?
> >>
> >> I wonder we can remove a hack of FOLL_ANON for core-dump by this flag, too.
> >
> > Yeah, GUP_FLAGS_NOFAULT is better.
> 
> Me too.
> I will change this flag name.
> 
> > Plus, this patch change __get_user_pages() return value meaning IOW.
> > after this patch, it can return following value,
> >
> > A return value: 3
> > A pages[0]: hoge-page
> > A pages[1]: null
> > A pages[2]: fuga-page
> >
> > but, it can be
> >
> > A return value: 2
> > A pages[0]: hoge-page
> > A pages[1]: fuga-page
> >
> > no?
> 
> I did misunderstand mean of get_user_pages()'s return value.
> 
> When I try to change __get_user_pages(), I got problem.
> If remove NULLs from pages,
> __mlock_vma_pages_range() cannot know how long __get_user_pages() readed.
> So, I have to get the virtual address of the page from vma and page.
> Because __mlock_vma_pages_range() have to call
> __get_user_pages() many times with different `start' argument.
> 
> I try to use page_address_in_vma(), but it failed.
> (page_address_in_vma() returned -EFAULT)
> I cannot find way to solve this problem.
> Are there good ideas?
> Please give me some ideas.


Could you satisfy your needs with this ?

--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -217,6 +217,11 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 
                lru_add_drain();        /* push cached pages to LRU */
 
+               /*
+                * here we assume that get_user_pages() has given us
+                * a list of virtually contiguous pages.
+                */
+               addr += PAGE_SIZE * ret; /* for next get_user_pages() */
                for (i = 0; i < ret; i++) {
                        struct page *page = pages[i];
 
@@ -234,12 +239,6 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
                        }
                        unlock_page(page);
                        put_page(page);         /* ref from get_user_pages() */
-
-                       /*
-                        * here we assume that get_user_pages() has given us
-                        * a list of virtually contiguous pages.
-                        */
-                       addr += PAGE_SIZE;      /* for next get_user_pages() */
                        nr_pages--;
                }
                ret = 0;

> 
> Thanks.
> --
> Hiroaki Wakabayashi


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
  2009-08-25  4:46       ` Hiroaki Wakabayashi
  2009-08-25  5:39         ` Minchan Kim
@ 2009-08-25  9:03         ` Hugh Dickins
  2009-08-25  9:37           ` KAMEZAWA Hiroyuki
                             ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Hugh Dickins @ 2009-08-25  9:03 UTC (permalink / raw)
  To: Hiroaki Wakabayashi
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Minchan Kim, Andrew Morton,
	LKML, linux-mm, Paul Menage, Ying Han, Pekka Enberg,
	Lee Schermerhorn

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3687 bytes --]

On Tue, 25 Aug 2009, Hiroaki Wakabayashi wrote:
> Thank you for reviews.
> 
> >>> > @@ -254,6 +254,7 @@ static inline void
> >>> > mminit_validate_memmodel_limits(unsigned long *start_pfn,
> >>> >  #define GUP_FLAGS_FORCE                  0x2
> >>> >  #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
> >>> >  #define GUP_FLAGS_IGNORE_SIGKILL         0x8
> >>> > +#define GUP_FLAGS_ALLOW_NULL             0x10
> >>> >
> >>>
> >>> I am worried about adding new flag whenever we need it.

Indeed!  See my comments below.

> >>> But I think this case makes sense to me.
> >>> In addition, I guess ZERO page can also use this flag.
> >>>
> >>> Kame. What do you think about it?
> >>>
> >> I do welcome this !
> >> Then, I don't have to take care of mlock/munlock in ZERO_PAGE patch.

I _think_ there's nothing to do for it (the page->mapping checks suit
the ZERO_PAGE); but I've not started testing my version, so may soon
be proved wrong.

> >>
> >> And without this patch, munlock() does copy-on-write just for unpinning memory.
> >> So, this patch shows some right direction, I think.
> >>
> >> One concern is flag name, ALLOW_NULL sounds not very good.
> >>
> >>  GUP_FLAGS_NOFAULT ?
> >>
> >> I wonder we can remove a hack of FOLL_ANON for core-dump by this flag, too.

No, the considerations there a different (it can only point to a ZERO_PAGE
where faulting would anyway present a page of zeroes); it should be dealt
with by a coredump-specific flag, rather than sowing confusion elsewhere.
As above, I've done that but not yet tested it.

> >
> > Yeah, GUP_FLAGS_NOFAULT is better.
> 
> Me too.
> I will change this flag name.
>... 
> When I try to change __get_user_pages(), I got problem.
> If remove NULLs from pages,
> __mlock_vma_pages_range() cannot know how long __get_user_pages() readed.
> So, I have to get the virtual address of the page from vma and page.
> Because __mlock_vma_pages_range() have to call
> __get_user_pages() many times with different `start' argument.
> 
> I try to use page_address_in_vma(), but it failed.
> (page_address_in_vma() returned -EFAULT)
> I cannot find way to solve this problem.
> Are there good ideas?
> Please give me some ideas.

I agree that this munlock issue needs to be addressed: it's not just a
matter of speedup, I hit it when testing what happens when mlock takes
you to OOM - which is currently a hanging disaster because munlock'ing
in the exiting OOM-killed process gets stuck trying to fault in all
those pages that couldn't be locked in the first place.

I had intended to fix it by being more careful about splitting/merging
vmas, noting how far the mlock had got, and munlocking just up to there.
However, now that I've got in there, that looks wrong to me, given the
traditional behaviour that mlock does its best, but pretends success
to allow for later instantiation of the pages if necessary.

You ask for ideas.  My main idea is that so far we have added
GUP_FLAGS_IGNORE_VMA_PERMISSIONS (Kosaki-san, what was that about?
                                  we already had the force flag),
GUP_FLAGS_IGNORE_SIGKILL, and now you propose
GUP_FLAGS_NOFAULT, all for the sole use of munlock.

How about GUP_FLAGS_MUNLOCK, or more to the point, GUP_FLAGS_DONT_BE_GUP?
By which I mean, don't all these added flags suggest that almost
everything __get_user_pages() does is unsuited to the munlock case?

My advice (but I sure hate giving advice before I've tried it myself)
is to put __mlock_vma_pages_range() back to handling just the mlock
case, and do your own follow_page() loop in munlock_vma_pages_range().

Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
  2009-08-25  9:03         ` Hugh Dickins
@ 2009-08-25  9:37           ` KAMEZAWA Hiroyuki
  2009-08-25 11:46             ` Minchan Kim
  2009-08-26  2:32           ` KOSAKI Motohiro
  2009-08-26 11:11           ` Hiroaki Wakabayashi
  2 siblings, 1 reply; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-25  9:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Hiroaki Wakabayashi, KOSAKI Motohiro, Minchan Kim, Andrew Morton,
	LKML, linux-mm, Paul Menage, Ying Han, Pekka Enberg,
	Lee Schermerhorn

On Tue, 25 Aug 2009 10:03:30 +0100 (BST)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:
> My advice (but I sure hate giving advice before I've tried it myself)
> is to put __mlock_vma_pages_range() back to handling just the mlock
> case, and do your own follow_page() loop in munlock_vma_pages_range().
> 

I have no objections to make use of follow_page().

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
  2009-08-25  9:37           ` KAMEZAWA Hiroyuki
@ 2009-08-25 11:46             ` Minchan Kim
  0 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2009-08-25 11:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, Hiroaki Wakabayashi, KOSAKI Motohiro,
	Andrew Morton, LKML, linux-mm, Paul Menage, Ying Han,
	Pekka Enberg, Lee Schermerhorn

On Tue, Aug 25, 2009 at 6:37 PM, KAMEZAWA
Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 25 Aug 2009 10:03:30 +0100 (BST)
> Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:
>> My advice (but I sure hate giving advice before I've tried it myself)
>> is to put __mlock_vma_pages_range() back to handling just the mlock
>> case, and do your own follow_page() loop in munlock_vma_pages_range().
>>
>
> I have no objections to make use of follow_page().

Me, too.
We don't need to add new flag although there is simple method like this.

> Thanks,
> -Kame
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
  2009-08-25  9:03         ` Hugh Dickins
  2009-08-25  9:37           ` KAMEZAWA Hiroyuki
@ 2009-08-26  2:32           ` KOSAKI Motohiro
  2009-08-26 11:11           ` Hiroaki Wakabayashi
  2 siblings, 0 replies; 12+ messages in thread
From: KOSAKI Motohiro @ 2009-08-26  2:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: kosaki.motohiro, Hiroaki Wakabayashi, KAMEZAWA Hiroyuki,
	Minchan Kim, Andrew Morton, LKML, linux-mm, Paul Menage,
	Ying Han, Pekka Enberg, Lee Schermerhorn

> > > Yeah, GUP_FLAGS_NOFAULT is better.
> > 
> > Me too.
> > I will change this flag name.
> >... 
> > When I try to change __get_user_pages(), I got problem.
> > If remove NULLs from pages,
> > __mlock_vma_pages_range() cannot know how long __get_user_pages() readed.
> > So, I have to get the virtual address of the page from vma and page.
> > Because __mlock_vma_pages_range() have to call
> > __get_user_pages() many times with different `start' argument.
> > 
> > I try to use page_address_in_vma(), but it failed.
> > (page_address_in_vma() returned -EFAULT)
> > I cannot find way to solve this problem.
> > Are there good ideas?
> > Please give me some ideas.
> 
> I agree that this munlock issue needs to be addressed: it's not just a
> matter of speedup, I hit it when testing what happens when mlock takes
> you to OOM - which is currently a hanging disaster because munlock'ing
> in the exiting OOM-killed process gets stuck trying to fault in all
> those pages that couldn't be locked in the first place.

I agree too.


> I had intended to fix it by being more careful about splitting/merging
> vmas, noting how far the mlock had got, and munlocking just up to there.
> However, now that I've got in there, that looks wrong to me, given the
> traditional behaviour that mlock does its best, but pretends success
> to allow for later instantiation of the pages if necessary.
> 
> You ask for ideas.  My main idea is that so far we have added
> GUP_FLAGS_IGNORE_VMA_PERMISSIONS (Kosaki-san, what was that about?
>                                   we already had the force flag),

MAY_WRITE and MAY_READ might be turned off at some special case.
but munlock should turn off PG_mlock bit. otherwise the page never be reclaimed.
This problem was explained by Lee about a year ago.

However, To use follow_page() solove this issue and we will be able to
remove this ugly flag.

> GUP_FLAGS_IGNORE_SIGKILL, and now you propose
> GUP_FLAGS_NOFAULT, all for the sole use of munlock.
> 
> How about GUP_FLAGS_MUNLOCK, or more to the point, GUP_FLAGS_DONT_BE_GUP?
> By which I mean, don't all these added flags suggest that almost
> everything __get_user_pages() does is unsuited to the munlock case?
> 
> My advice (but I sure hate giving advice before I've tried it myself)
> is to put __mlock_vma_pages_range() back to handling just the mlock
> case, and do your own follow_page() loop in munlock_vma_pages_range().

Agreed. follow_page() is better.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: make munlock fast when mlock is canceled by sigkill
  2009-08-25  9:03         ` Hugh Dickins
  2009-08-25  9:37           ` KAMEZAWA Hiroyuki
  2009-08-26  2:32           ` KOSAKI Motohiro
@ 2009-08-26 11:11           ` Hiroaki Wakabayashi
  2 siblings, 0 replies; 12+ messages in thread
From: Hiroaki Wakabayashi @ 2009-08-26 11:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Minchan Kim, Andrew Morton,
	LKML, linux-mm, Paul Menage, Ying Han, Pekka Enberg,
	Lee Schermerhorn

Thank you for ideas and advices!

2009/8/25 Hugh Dickins <hugh.dickins@tiscali.co.uk>:
> On Tue, 25 Aug 2009, Hiroaki Wakabayashi wrote:
>> Thank you for reviews.
>>
>> >>> > @@ -254,6 +254,7 @@ static inline void
>> >>> > mminit_validate_memmodel_limits(unsigned long *start_pfn,
>> >>> >  #define GUP_FLAGS_FORCE                  0x2
>> >>> >  #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
>> >>> >  #define GUP_FLAGS_IGNORE_SIGKILL         0x8
>> >>> > +#define GUP_FLAGS_ALLOW_NULL             0x10
>> >>> >
>> >>>
>> >>> I am worried about adding new flag whenever we need it.
>
> Indeed!  See my comments below.
>
>> >>> But I think this case makes sense to me.
>> >>> In addition, I guess ZERO page can also use this flag.
>> >>>
>> >>> Kame. What do you think about it?
>> >>>
>> >> I do welcome this !
>> >> Then, I don't have to take care of mlock/munlock in ZERO_PAGE patch.
>
> I _think_ there's nothing to do for it (the page->mapping checks suit
> the ZERO_PAGE); but I've not started testing my version, so may soon
> be proved wrong.
>
>> >>
>> >> And without this patch, munlock() does copy-on-write just for unpinning memory.
>> >> So, this patch shows some right direction, I think.
>> >>
>> >> One concern is flag name, ALLOW_NULL sounds not very good.
>> >>
>> >>  GUP_FLAGS_NOFAULT ?
>> >>
>> >> I wonder we can remove a hack of FOLL_ANON for core-dump by this flag, too.
>
> No, the considerations there a different (it can only point to a ZERO_PAGE
> where faulting would anyway present a page of zeroes); it should be dealt
> with by a coredump-specific flag, rather than sowing confusion elsewhere.
> As above, I've done that but not yet tested it.
>
>> >
>> > Yeah, GUP_FLAGS_NOFAULT is better.
>>
>> Me too.
>> I will change this flag name.
>>...
>> When I try to change __get_user_pages(), I got problem.
>> If remove NULLs from pages,
>> __mlock_vma_pages_range() cannot know how long __get_user_pages() readed.
>> So, I have to get the virtual address of the page from vma and page.
>> Because __mlock_vma_pages_range() have to call
>> __get_user_pages() many times with different `start' argument.
>>
>> I try to use page_address_in_vma(), but it failed.
>> (page_address_in_vma() returned -EFAULT)
>> I cannot find way to solve this problem.
>> Are there good ideas?
>> Please give me some ideas.
>
> I agree that this munlock issue needs to be addressed: it's not just a
> matter of speedup, I hit it when testing what happens when mlock takes
> you to OOM - which is currently a hanging disaster because munlock'ing
> in the exiting OOM-killed process gets stuck trying to fault in all
> those pages that couldn't be locked in the first place.

I'm sorry, it too difficult for me to understand.
I will learn and consider.

> I had intended to fix it by being more careful about splitting/merging
> vmas, noting how far the mlock had got, and munlocking just up to there.
> However, now that I've got in there, that looks wrong to me, given the
> traditional behaviour that mlock does its best, but pretends success
> to allow for later instantiation of the pages if necessary.
>
> You ask for ideas.  My main idea is that so far we have added
> GUP_FLAGS_IGNORE_VMA_PERMISSIONS (Kosaki-san, what was that about?
>                                  we already had the force flag),
> GUP_FLAGS_IGNORE_SIGKILL, and now you propose
> GUP_FLAGS_NOFAULT, all for the sole use of munlock.
>
> How about GUP_FLAGS_MUNLOCK, or more to the point, GUP_FLAGS_DONT_BE_GUP?
> By which I mean, don't all these added flags suggest that almost
> everything __get_user_pages() does is unsuited to the munlock case?
>
> My advice (but I sure hate giving advice before I've tried it myself)
> is to put __mlock_vma_pages_range() back to handling just the mlock
> case, and do your own follow_page() loop in munlock_vma_pages_range().
>
> Hugh

Me, too. I agree __get_user_pages() unsuited to the munlock.
I let try to make follow_page() loop, and remove
 GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL.

Thanks!
--
Hiroaki Wakabayashi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-08-26 11:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-22 16:54 [PATCH] mm: make munlock fast when mlock is canceled by sigkill Hiroaki Wakabayashi
2009-08-24  1:44 ` Minchan Kim
2009-08-24  1:51   ` KAMEZAWA Hiroyuki
2009-08-24  2:23     ` Minchan Kim
2009-08-24  4:13     ` KOSAKI Motohiro
2009-08-25  4:46       ` Hiroaki Wakabayashi
2009-08-25  5:39         ` Minchan Kim
2009-08-25  9:03         ` Hugh Dickins
2009-08-25  9:37           ` KAMEZAWA Hiroyuki
2009-08-25 11:46             ` Minchan Kim
2009-08-26  2:32           ` KOSAKI Motohiro
2009-08-26 11:11           ` Hiroaki Wakabayashi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox