Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andi Kleen <andi@firstfloor.org>
To: Chris Frost <frost@cs.ucla.edu>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Benny Halevy <bhalevy@panasas.com>,
	Andrew@firstfloor.org, "Morton <akpm"@linux-foundation.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Steve VanDeBogart <vandebo-lkml@nerdbox.net>
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)
Date: Thu, 21 Jan 2010 02:11:59 +0100	[thread overview]
Message-ID: <87k4vc2rds.fsf@basil.nowhere.org> (raw)
In-Reply-To: <20100120215712.GO27212@frostnet.net> (Chris Frost's message of "Wed, 20 Jan 2010 13:57:12 -0800")

Chris Frost <frost@cs.ucla.edu> writes:

> For a microbenchmark that sequentially queries whether the pages of a large
> file are in memory fincore is 7-11x faster than mmap+mincore+munmap
> when querying one page a time (Pentium 4 running a 32 bit SMP kernel).

I haven't read your paper, but naively it was not fully clear 
to me why the application can't simply prefetch everything and let
the kernel worry if it's already in memory or not?

Also I'm always wondering why people do these optimizations
only now when spinning storage is about to become obsolete @)
It seems a bit like the last steam engine train.

> In this patch find_get_page() is called for each page, which in turn
> calls rcu_read_lock(), for each page. We have found that amortizing

rcu_read_lock is normally a no-op (or rather just a compiler barrier)
Even on preemptive kernels it's quite cheap and always local. It doesn't
make too much sense to optimize around it.

Also it's custom to supply man page with new system calls.
Such independent documentation often flushes out a lot of semantic issues.

+SYSCALL_DEFINE4(fincore, unsigned int, fd, loff_t, start, loff_t, len,
+		unsigned char __user *, vec)

I doubt the loff_t actually work for 32bit processes on 64bit kernels
That typically needs a special compat stub that reassembles the 64bit values
from the two registers.

Also on 32bit you'll end with a 6 argument call, which can be problematic.

> +	/*
> +	 * Allocate buffer vector page.
> +	 * Optimize allocation for small values of npages because the
> +	 * __get_free_page() call doubles fincore(2) runtime when npages == 1.
> +	 */

I suspect you could afford slightly more than 64 bytes on the stack.

> +	if (npages <= sizeof(tmp_small)) {
> +		tmp = tmp_small;
> +		tmp_count = sizeof(tmp_small);
> +	} else {
> +		tmp = (void *) __get_free_page(GFP_USER);
> +		if (!tmp) {
> +			retval = -EAGAIN;
> +			goto done;
> +		}
> +		tmp_count = PAGE_SIZE;

tmp_* are impressively bad variable names.

> +	}
> +
> +	while (pgoff < pgend) {
> +		/*
> +		 * Do at most tmp_count entries per iteration, due to
> +		 * the temporary buffer size.
> +		 */
> +		for (i = 0; pgoff < pgend && i < tmp_count; pgoff++, i++)
> +			tmp[i] = fincore_page(filp->f_mapping, pgoff);

If you really care about speed you could probably do it much faster
with a radix gang lookup for a larger range. And of course 
the get/put is not really needed, although avoiding that might
add too many special cases.

This loop needs a need_resched() somewhere, otherwise
someone could cause very large latencies in the kernel.

But even if you added that:

e.g. if I create a 1TB file and run it over the full range,
will I get a process that cannot be Ctrl-C'ed for a long time?

Perhaps some signal handling is needed too?

Still also would be undebuggable in that time. It might 
be best to simply limit it to some reasonable upper limit.
Most system calls do that in some form.

> +
> +		if (copy_to_user(vec, tmp, i)) {

When you used access_ok() earlier you could use __copy_to_user,
but since that's only a few instructions I would rather drop
the unnecessary access_ok() earlier.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-01-21  1:12 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-20 21:57 Chris Frost
2010-01-21  1:11 ` Andi Kleen [this message]
2010-02-16 18:13   ` Chris Frost
2010-02-21  3:02     ` Andy Isaacson
2010-02-21  3:25       ` Wu Fengguang
2010-02-23 16:39         ` Andy Isaacson
2010-05-07 22:46       ` Cédric Villemain
2010-01-22  1:17 ` Wu Fengguang
2010-01-22  1:29 ` Paul E. McKenney
2010-01-26 22:12 ` Andrew Morton
2010-01-28  7:42   ` Steve VanDeBogart
2010-01-28  8:23     ` Andrew Morton
2010-01-28  8:32       ` Steve VanDeBogart
2010-01-28 23:54       ` Andres Freund
2010-01-27 18:14 ` Jamie Lokier
2010-01-28  8:23   ` Steve VanDeBogart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87k4vc2rds.fsf@basil.nowhere.org \
    --to=andi@firstfloor.org \
    --cc="Morton <akpm"@linux-foundation.org \
    --cc=Andrew@firstfloor.org \
    --cc=bhalevy@panasas.com \
    --cc=frost@cs.ucla.edu \
    --cc=heiko.carstens@de.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=vandebo-lkml@nerdbox.net \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox