Re: [patch 2/2] mm: lockless pagecache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nick Piggin <npiggin@suse.de>
To: Andy Whitcroft <apw@shadowen.org>
Cc: Andrew Morton <akpm@osdl.org>,
	Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [patch 2/2] mm: lockless pagecache
Date: Tue, 1 Aug 2006 11:04:28 +0200	[thread overview]
Message-ID: <20060801090428.GB17452@wotan.suse.de> (raw)
In-Reply-To: <44CE2365.6040605@shadowen.org>

On Mon, Jul 31, 2006 at 04:36:05PM +0100, Andy Whitcroft wrote:
> Nick Piggin wrote:
> >Combine page_cache_get_speculative with lockless radix tree lookups to
> >introduce lockless page cache lookups (ie. no mapping->tree_lock on
> >the read-side).
> >
> >The only atomicity changes this introduces is that the gang pagecache
> >lookup functions now behave as if they are implemented with multiple
> >find_get_page calls, rather than operating on a snapshot of the pages.
> >In practice, this atomicity guarantee is not used anyway, and it is
> >difficult to see how it could be. Gang pagecache lookups are designed
> >to replace individual lookups, so these semantics are natural.
> >
> >Swapcache can no longer use find_get_page, because it has a different
> >method of encoding swapcache position into the page. Introduce a new
> >find_get_swap_page for it.
> >
> >Signed-off-by: Nick Piggin <npiggin@suse.de>
> >
> > include/linux/swap.h |    1
> > mm/filemap.c         |  161 
> > +++++++++++++++++++++++++++++++++++++--------------
> > mm/page-writeback.c  |    8 --
> > mm/readahead.c       |    7 --
> > mm/swap_state.c      |   27 +++++++-
> > mm/swapfile.c        |    2
> > 6 files changed, 150 insertions(+), 56 deletions(-)
> >
> 
> It seems in these routines you have two different placements for the rcu 
> locking.  Either outside or inside the repeat.  Should we assume that 
> those where the locks are outside the repeat: loop have very light payloads?

Yeah, I think the "inside" rcu locking is just used where we have to
sleep. Indeed the loop should be very light I expect it will almost
never retry, and only rarely spin on NoNewRefs.

> >@@ -613,11 +613,22 @@ struct page *find_trylock_page(struct ad
> > {
> > 	struct page *page;
> > 
> >-	read_lock_irq(&mapping->tree_lock);
> >+	rcu_read_lock();
> >+repeat:
> > 	page = radix_tree_lookup(&mapping->page_tree, offset);
> >-	if (page && TestSetPageLocked(page))
> >-		page = NULL;
> >-	read_unlock_irq(&mapping->tree_lock);
> >+	if (page) {
> >+		page = page_cache_get_speculative(page);
> >+		if (unlikely(!page))
> >+			goto repeat;
> >+		/* Has the page been truncated? */
> >+		if (unlikely(page->mapping != mapping
> >+				|| page->index != offset)) {
> >+			page_cache_release(page);
> >+			goto repeat;
> >+		}
> >+	}
> >+	rcu_read_unlock();
> >+
> > 	return page;
> > }
> 
> This one has me puzzled.  This seem to no longer lock the page at all 
> when returning it.  It seems the semantics of this has changed wildly. 
> Also find_lock_page below still seems to lock the page, the semantic 
> seems maintained there?  I think I am expecting to find a 
> TestSetPageLocked() in the new version too?

Yeah, looks like I stuffed up merging it. Didn't notice because nobody
uses find_trylock_page (as Hugh points out, this should be the find_get_page
body, and find_trylock should be unchanged).

> >+	rcu_read_lock();
> >+repeat:
> >+	nr_found = radix_tree_gang_lookup(&mapping->page_tree,
> > 				(void **)pages, start, nr_pages);
> >-	for (i = 0; i < ret; i++)
> >-		page_cache_get(pages[i]);
> >-	read_unlock_irq(&mapping->tree_lock);
> >-	return ret;
> >+	for (i = 0; i < nr_found; i++) {
> >+		struct page *page;
> >+		page = page_cache_get_speculative(pages[i]);
> >+		if (unlikely(!page)) {
> >+bail:
> >+			/*
> >+			 * must return at least 1 page, so caller continues
> >+			 * calling in.
> 
> Although that is a resonable semantic, several callers seem to expect 
> all or nothing semantics here.  Mostly the direct callers to 
> find_get_pages().  The callers using pagevec_lookup() at least seem to 
> cope with a partial left fill as implemented here.

I don't see any problems with find_get_pages callers (splice and ramfs).
(hmm, ramfs should be moved over to find_get_pages_contig). You see, if
the pages being looked up are getting chucked out of pagecache anyway,
then then it could be valid to return.

Actually: find_get_pages is a little iffy, but ramfs is happy with that.
Maybe I'll reintroduce my find_get_pages_nonatomic, and leave fgps under
lock.

pagevec_lookup users seem to be fine. Maybe an API rename is in order
for them too, though.

> 
> >+			 */
> >+			if (i == 0)
> >+				goto repeat;
> >+			break;
> >+		}
> >+
> >+		/* Has the page been truncated? */
> >+		if (unlikely(page->mapping != mapping
> >+				|| page->index < start)) {
> >+			page_cache_release(page);
> >+			goto bail;
> 
> I have looked at this check for a while now and I can say I am troubled 
> by it.  We do not know which page we are looking up so can we truly say 
> the index check here is sufficient?  Also, could not the start= below 
> lead us to follow a moving page and skip pages?  Perhaps there is no way 

Hmm, it may move upwards and lead us to skip I think. Good point, I'll
look at that.

> to get any sort of guarentee with this interface before or after this 
> change; and all is well?  Tell me it is :).

Well the gang lookups were generally introduced to replace multiple calls
to fgp, so callers don't use the requirement of an atomic snapshot. However
you can still use the tree_lock to get the old semantics.

> 
> >+		}
> >+
> >+		/* ensure we don't pick up pages that have moved behind us */
> >+		start = page->index+1;
> >+	}
> >+	rcu_read_unlock();
> >+	return i;
> > }
> > 
> > /**
> >@@ -752,19 +786,35 @@ unsigned find_get_pages_contig(struct ad
> > 			       unsigned int nr_pages, struct page **pages)
> > {
> > 	unsigned int i;
> >-	unsigned int ret;
> >+	unsigned int nr_found;
> > 
> >-	read_lock_irq(&mapping->tree_lock);
> >-	ret = radix_tree_gang_lookup(&mapping->page_tree,
> >+	rcu_read_lock();
> >+repeat:
> >+	nr_found = radix_tree_gang_lookup(&mapping->page_tree,
> > 				(void **)pages, index, nr_pages);
> >-	for (i = 0; i < ret; i++) {
> >-		if (pages[i]->mapping == NULL || pages[i]->index != index)
> >+	for (i = 0; i < nr_found; i++) {
> >+		struct page *page;
> >+		page = page_cache_get_speculative(pages[i]);
> >+		if (unlikely(!page)) {
> >+bail:
> >+			/*
> >+			 * must return at least 1 page, so caller continues
> >+			 * calling in.
> >+			 */
> >+			if (i == 0)
> >+				goto repeat;
> > 			break;
> >+		}
> > 
> >-		page_cache_get(pages[i]);
> >+		/* Has the page been truncated? */
> >+		if (unlikely(page->mapping != mapping
> >+				|| page->index != index)) {
> >+			page_cache_release(page);
> >+			goto bail;
> >+		}
> 
> Ok, normally this construct is checking against the page at 
> (mapping,index) so it is very unlikely that the index does not match. 
> However in this case we are doing a contiguity scan, so in fact the 
> likelyhood of this missmatching is more defined by the likelyhood of 
> contiguity in the mapping.  The check originally had no such hints?  Is 
> it appropriate to have a hint here?

Yes. We have nothing protecting the page against being truncated or
moved by splice (wheras previously tree_lock would do it).

Thanks again for the review. I'll try to fix up the outstanding
issues you identified.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

     prev parent reply	other threads:[~2006-08-01  9:04 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-07-26  6:39 Nick Piggin
2006-07-31 15:36 ` Andy Whitcroft
2006-07-31 18:34   ` Hugh Dickins
2006-08-01  9:04   ` Nick Piggin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060801090428.GB17452@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=akpm@osdl.org \
    --cc=apw@shadowen.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox