linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Andrea Arcangeli <aarcange@redhat.com>,
	Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Pavel Emelyanov <xemul@parallels.com>,
	zhang.zhanghailiang@huawei.com,
	Dave Hansen <dave.hansen@intel.com>,
	Rik van Riel <riel@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Bamvor Zhang Jian <bamvor.zhangjian@linaro.org>,
	Bharata B Rao <bharata@linux.vnet.ibm.com>,
	Geert Uytterhoeven <geert@linux-m68k.org>
Subject: Re: [PATCH 00/12] userfaultfd non-x86 and selftest updates for 4.2.0+
Date: Wed, 30 Sep 2015 17:42:09 -0700	[thread overview]
Message-ID: <560C8161.5020602@oracle.com> (raw)
In-Reply-To: <20151001000625.GF19466@redhat.com>

On 09/30/2015 05:06 PM, Andrea Arcangeli wrote:
> Hello Mike,
> 
> On Wed, Sep 30, 2015 at 02:56:19PM -0700, Mike Kravetz wrote:
>> On 09/08/2015 01:43 PM, Andrea Arcangeli wrote:
>>> Here are some pending updates for userfaultfd mostly to the self test,
>>> the rest are cleanups.
>>
>> I have a potential use case for userfualtfd.  So, I started experimenting
> 
> Glad to hear you may have one more use case.
> 
> On a side note, there's also a patch posted to CRIU to pagein lazily
> anonymous memory during restore using userfaultfd, that's yet another
> recent user.
> 
>> with the self test code.  I replaced the posix_memalign() calls to allocate
>> area_src and area_dst with mmap().  mmap(MAP_PRIVATE | MAP_ANONYMOUS) works
>> as expected.  However, mmap(MAP_SHARED | MAP_ANONYMOUS) causes the test to
>> fail without any errros from the userfaultfd APIs.
>>
>> --------------------
>> running userfaultfd
>> --------------------
>> nr_pages: 32768, nr_pages_per_cpu: 8192
>> bounces: 31, mode: rnd racing ver poll, page_nr 31523 wrong count 0 1
>>
>> I would expect some type of error from the ioctl() that registers the
>> range, or perhaps the poll/copy code?  Just curious about the expected
>> behavior.
> 
> That should return an error during UFFDIO_REGISTER and the testcase
> shouldn't start, not sure what went wrong. Can you send the
> modification to the testcase?
> 
> UFFDIO_REGISTER is the point where userfaultfd is first told which
> kind of memory you want to manage with userfaults. It was planned to
> fail there (and it cannot fail any earlier).
> 
> This check has to fail and return -EINVAL in the ioctl(UFFDIO_REGISTER).
> 
> 		/* check not compatible vmas */
> 		ret = -EINVAL;
> 		if (cur->vm_ops)
> 			goto out_unlock;
> 
> In the testcase you should get an exit 1 and the fprintf printed:
> 
> 		if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
> 			fprintf(stderr, "register failure\n");
> 			return 1;
> 		}
> 
> Could you double check these two paths to find what's wrong?

My apologies!!!!

This was running in my hacked up kernel.  I removed the cur->vm_ops check
as a quick and dirty way for me to register hugetlb vmas.  Sorry, I forgot
I was still running this kernel.

>> FYI - My use case is for hugetlbfs.  I would like a mechanism to catch all
>> new huge page allocations as a result of page faults.  I have some very
>> rough code to extend userfualtfd and add the required functionality to
>> hugetlbfs.  Still working on it.
> 
> Adding support for hugetlbfs sounds great to me.

The use case I have is pretty simple.  Recently, fallocate hole punch
support was added to hugetlbfs.  The reason for this is that the database
people want to 'free up' huge pages they know will no longer be used.
However, these huge pages are part of SGA areas sometimes mapped by tens
of thousands of tasks.  They would like to 'catch' any tasks that
(incorrectly) fault in a page after hole punch.  The thought is that
this can be done with userfaultfd by registering these mappings with
UFFDIO_REGISTER_MODE_MISSING.  No need for UFFDIO_COPY or UFFDIO_ZEROPAGE.
We would just send a signal to the task (such as SIGBUS) and then do
a UFFDIO_WAKE.  The only downside to this approach is having thousands
of threads monitoring userfault fds to catch a database error condition.
I believe the MADV_USERFAULT/NOUSERFAULT code you proposed some time back
would be the ideal solution for this use case.  Unfortunately, I did not
know of this use case or your proposal back then. :(

-- 
Mike Kravetz

> 
> Only anonymous memory has null vm_ops, so once you extend the code to
> track hugetlbfs (tracking at least tmpfs and not just anonymous memory
> is needed for volatile pages which also work on tmpfs) you should
> relax the above check to accept &hugetlb_vm_ops.
> 
> You then need to specify which kind of ioctl you supported in the
> current kernel for that kind of memory you registered on in the
> uffdio_register->ioctl parameter.
> 
> 		/*
> 		 * Now that we scanned all vmas we can already tell
> 		 * userland which ioctls methods are guaranteed to
> 		 * succeed on this range.
> 		 */
> 		if (put_user(UFFD_API_RANGE_IOCTLS,
> 			     &user_uffdio_register->ioctls))
> 			ret = -EFAULT;
> 
> #define UFFD_API_RANGE_IOCTLS			\
> 	((__u64)1 << _UFFDIO_WAKE |		\
> 	 (__u64)1 << _UFFDIO_COPY |		\
> 	 (__u64)1 << _UFFDIO_ZEROPAGE)
> 
> hugetlbfs doesn't seem to support the zeropage. So if vma->vm_ops ==
> &hugetlb_vm_ops, it should return only WAKE|COPY in
> uffdio_register->ioctl.
> 
> hugetlbfs is non standard, there's no sysconf(_SC_PAGE_SIZE) to know
> the minimum granularity supported by the UFFDIO_COPY|WAKE of
> hugetlbfs. This is a generic issue with hugetlbfs, not really related
> to userfaultfd. The same constraints of hugetlbfs minimum granularity
> and alignment applies to all other memory management syscalls too.
> 
> So the app itself using hugetlbfs will have to know by other means
> (i.e. sysfs mangling) that the minimum granularity supported by
> UFFDIO_COPY is 2MB (or 1GB). That is again because it registered
> userfaultfd on hugetlbfs, and hugetlbfs has non standard
> constraints. In turn UFFDIO_COPY of hugetlbfs has to fail if len is
> not a multiple of 2MB (never the case for all other kinds of memory
> that userfaultfd could ever manage).
> 
> There's flexibility in the userfaultfd API to gradually expand the
> coverage to a variety of types of virtual memory while at the same
> time not risking random behavior from a new app if run on a old
> kernel. The new app will be able to tell reliably to the user, to
> upgrade the kernel (or it can fallback to a non-userfaultfd mode with
> just a warning to the user).
> 
> We need to handle the write protection faults too as soon as possible
> (VM_UFFD_WP/UFFD_FEATURE_PAGEFAULT_FLAG_WP). The uffdio_api->features
> are already prepared to report to userland the availability of the
> UFFD_FEATURE_PAGEFAULT_FLAG_WP. Then the app can set
> UFFDIO_REGISTER_MODE_WP in uffdio_register.mode.
> 
> I mentioned this because while there's flexibility to expand the
> coverage gradually, it'd be great if all kinds of memory supporting
> UFFDIO_REGISTER_MODE_MISSING would also support
> UFFDIO_REGISTER_MODE_WP once that gets available, as it'd keep
> userfaultfd_register() a bit simpler to maintain.
> 
> Thanks,
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-10-01  0:42 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-08 20:43 Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 01/12] userfaultfd: selftest: update userfaultfd x86 32bit syscall number Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 02/12] userfaultfd: Revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key" Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 03/12] userfaultfd: selftests: vm: pick up sanitized kernel headers Andrea Arcangeli
2015-09-09  2:48   ` Michael Ellerman
2015-09-08 20:43 ` [PATCH 04/12] userfaultfd: selftest: headers fixup Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 05/12] userfaultfd: selftest: only warn if __NR_userfaultfd is undefined Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 06/12] userfaultfd: selftest: avoid my_bcmp false positives with powerpc Andrea Arcangeli
2015-09-09  2:50   ` Michael Ellerman
2015-09-09 17:02     ` Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 07/12] userfaultfd: selftest: Fix compiler warnings on 32-bit Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 08/12] userfaultfd: selftest: return an error if BOUNCE_VERIFY fails Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 09/12] userfaultfd: selftest: don't error out if pthread_mutex_t isn't identical Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 10/12] userfaultfd: powerpc: Bump up __NR_syscalls to account for __NR_userfaultfd Andrea Arcangeli
2015-09-09  2:48   ` Michael Ellerman
2015-09-08 20:43 ` [PATCH 11/12] userfaultfd: powerpc: implement syscall Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 12/12] userfaultfd: register uapi generic syscall (aarch64) Andrea Arcangeli
2015-09-15 20:02   ` Andrew Morton
2015-09-15 20:20     ` Mathieu Desnoyers
2015-09-15 20:47     ` Andrea Arcangeli
2015-09-30 21:56 ` [PATCH 00/12] userfaultfd non-x86 and selftest updates for 4.2.0+ Mike Kravetz
2015-10-01  0:06   ` Andrea Arcangeli
2015-10-01  0:42     ` Mike Kravetz [this message]
2015-10-01 16:04       ` Andrea Arcangeli
2015-10-01 16:45         ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=560C8161.5020602@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bamvor.zhangjian@linaro.org \
    --cc=bharata@linux.vnet.ibm.com \
    --cc=dave.hansen@intel.com \
    --cc=dgilbert@redhat.com \
    --cc=geert@linux-m68k.org \
    --cc=linux-mm@kvack.org \
    --cc=mpe@ellerman.id.au \
    --cc=peter.huangpeng@huawei.com \
    --cc=riel@redhat.com \
    --cc=xemul@parallels.com \
    --cc=zhang.zhanghailiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox