Re: [RFC 1/3] lib: copy_{from,to}_user using gup & kmap_atomic()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Arnd Bergmann <arnd@arndb.de>
To: afzal mohammed <afzal.mohd.ma@gmail.com>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>,
	Linus Walleij <linus.walleij@linaro.org>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	 Linux ARM <linux-arm-kernel@lists.infradead.org>,
	Nicolas Pitre <nico@fluxnic.net>,
	 Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>
Subject: Re: [RFC 1/3] lib: copy_{from,to}_user using gup & kmap_atomic()
Date: Fri, 12 Jun 2020 14:02:13 +0200	[thread overview]
Message-ID: <CAK8P3a1XUJHC0kG_Qwh4D4AoxTgCL5ggHd=45yNSmzaYWLUWXw@mail.gmail.com> (raw)
In-Reply-To: <9e1de19f35e2d5e1d115c9ec3b7c3284b4a4e077.1591885760.git.afzal.mohd.ma@gmail.com>

On Fri, Jun 12, 2020 at 12:18 PM afzal mohammed <afzal.mohd.ma@gmail.com> wrote:
>
> copy_{from,to}_user() uaccess helpers are implemented by user page
> pinning, followed by temporary kernel mapping & then memcpy(). This
> helps to achieve user page copy when current virtual address mapping
> of the CPU excludes user pages.
>
> Performance wise, results are not encouraging, 'dd' on tmpfs results,
>
> ARM Cortex-A8, BeagleBone White (256MiB RAM):
> w/o series - ~29.5 MB/s
> w/ series - ~20.5 MB/s
> w/ series & highmem disabled - ~21.2 MB/s
>
> On Cortex-A15(2GiB RAM) in QEMU:
> w/o series - ~4 MB/s
> w/ series - ~2.6 MB/s
>
> Roughly a one-third drop in performance. Disabling highmem improves
> performance only slightly.
>
> 'hackbench' also showed a similar pattern.
>
> uaccess routines using page pinning & temporary kernel mapping is not
> something new, it has been done long long ago by Ingo [1] as part of
> 4G/4G user/kernel mapping implementation on x86, though not merged in
> mainline.
>
> [1] https://lore.kernel.org/lkml/Pine.LNX.4.44.0307082332450.17252-100000@localhost.localdomain/

Nice changelog text! I agree the performance drop is not great.
There are probably some things that can be done to optimize it,
but I guess most of the overhead is from the page table operations
and cannot be avoided.

What was the exact 'dd' command you used, in particular the block size?
Note that by default, 'dd' will request 512 bytes at a time, so you usually
only access a single page. It would be interesting to see the overhead with
other typical or extreme block sizes, e.g. '1', '64', '4K', '64K' or '1M'.

If you want to drill down into where exactly the overhead is (i.e.
get_user_pages or kmap_atomic, or something different), using
'perf record dd ..', and 'perf report' may be helpful.

> +static int copy_chunk_from_user(unsigned long from, int len, void *arg)
> +{
> +       unsigned long *to_ptr = arg, to = *to_ptr;
> +
> +       memcpy((void *) to, (void *) from, len);
> +       *to_ptr += len;
> +       return 0;
> +}
> +
> +static int copy_chunk_to_user(unsigned long to, int len, void *arg)
> +{
> +       unsigned long *from_ptr = arg, from = *from_ptr;
> +
> +       memcpy((void *) to, (void *) from, len);
> +       *from_ptr += len;
> +       return 0;
> +}

Will gcc optimize away the indirect function call and inline everything?
If not, that would be a small part of the overhead.

> +unsigned long gup_kmap_copy_from_user(void *to, const void __user *from, unsigned long n)
> +{
> +       struct page **pages;
> +       int num_pages, ret, i;
> +
> +       if (uaccess_kernel()) {
> +               memcpy(to, (__force void *)from, n);
> +               return 0;
> +       }
> +
> +       num_pages = DIV_ROUND_UP((unsigned long)from + n, PAGE_SIZE) -
> +                                (unsigned long)from / PAGE_SIZE;

Make sure this doesn't turn into actual division operations but uses shifts.
It might even be clearer here to open-code the shift operation so readers
can see what this is meant to compile into.

> +       pages = kmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL | __GFP_ZERO);
> +       if (!pages)
> +               goto end;

Another micro-optimization may be to avoid the kmalloc for the common case,
e.g. anything with "num_pages <= 64", using an array on the stack.

> +       ret = get_user_pages_fast((unsigned long)from, num_pages, 0, pages);
> +       if (ret < 0)
> +               goto free_pages;
> +
> +       if (ret != num_pages) {
> +               num_pages = ret;
> +               goto put_pages;
> +       }

I think this is technically incorrect: if get_user_pages_fast() only
gets some of the
pages, you should continue with the short buffer and return the number
of remaining
bytes rather than not copying anything. I think you did that correctly
for a failed
kmap_atomic(), but this has to use the same logic.

     Arnd

next prev parent reply	other threads:[~2020-06-12 12:02 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-12 10:17 [RFC 0/3] ARM: copy_{from,to}_user() for vmsplit 4g/4g afzal mohammed
2020-06-12 10:17 ` [RFC 1/3] lib: copy_{from,to}_user using gup & kmap_atomic() afzal mohammed
2020-06-12 12:02   ` Arnd Bergmann [this message]
2020-06-12 13:55     ` afzal mohammed
2020-06-12 20:07       ` Arnd Bergmann
2020-06-13 12:04         ` afzal mohammed
2020-06-13 12:51           ` Al Viro
2020-06-13 12:56             ` Al Viro
2020-06-13 13:42               ` afzal mohammed
2020-06-13 15:31                 ` Al Viro
2020-06-13 15:41                   ` Al Viro
2020-06-13 16:00                     ` Al Viro
2020-06-13 18:55                       ` Arnd Bergmann
2020-06-13 13:15           ` Russell King - ARM Linux admin
2020-06-14 13:06             ` afzal mohammed
2020-06-13 20:45           ` Arnd Bergmann
2020-06-13 22:16             ` Matthew Wilcox
2020-06-14 13:21             ` afzal mohammed
2020-06-14 14:55               ` afzal mohammed
2020-06-13 11:08   ` Andy Shevchenko
2020-06-13 13:29     ` afzal mohammed
2020-06-12 10:18 ` [RFC 2/3] ARM: uaccess: let UACCESS_GUP_KMAP_MEMCPY enabling afzal mohammed
2020-06-12 10:18 ` [RFC 3/3] ARM: provide CONFIG_VMSPLIT_4G_DEV for development afzal mohammed
2020-06-12 15:19 ` [RFC 0/3] ARM: copy_{from,to}_user() for vmsplit 4g/4g Nicolas Pitre
2020-06-12 16:01   ` afzal mohammed
2020-06-12 16:03     ` afzal mohammed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAK8P3a1XUJHC0kG_Qwh4D4AoxTgCL5ggHd=45yNSmzaYWLUWXw@mail.gmail.com' \
    --to=arnd@arndb.de \
    --cc=afzal.mohd.ma@gmail.com \
    --cc=catalin.marinas@arm.com \
    --cc=linus.walleij@linaro.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@armlinux.org.uk \
    --cc=nico@fluxnic.net \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox