linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@zip.com.au>
To: Daniel Phillips <phillips@arcor.de>
Cc: Christian Ehrhardt <ehrhardt@mathematik.uni-ulm.de>,
	lkml <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: MM patches against 2.5.31
Date: Mon, 26 Aug 2002 12:24:50 -0700	[thread overview]
Message-ID: <3D6A8082.3775C5AB@zip.com.au> (raw)
In-Reply-To: <E17jO6g-0002XU-00@starship>

Daniel Phillips wrote:
> 
> On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > + * does not increase the page count.
> > >
> > > Please remind me... why should it not?
> >
> > Pages that are only on the lru but not reference by anyone are of no
> > use and we want to free them immediatly. If we leave them on the lru
> > list with a page count of 1, someone else will have to walk the lru
> > list and remove pages that are only on the lru.
> 
> I don't understand this argument.  Suppose lru list membership is worth a
> page count of one.  Then anyone who finds a page by way of the lru list can
> safely put_page_testzero and remove the page from the lru list.  Anyone who
> finds a page by way of a page table can likewise put_page_testzero and clear
> the pte, or remove the mapping and pass the page to Andrew's pagevec
> machinery, which will eventually do the put_page_testzero.  Anyone who
> removes a page from a radix tree will also do a put_page_testzero.  Exactly
> one of those paths will result in the page count reaching zero, which tells
> us nobody else holds a reference and it's time for __free_pages_ok.  The page
> is thus freed immediately as soon as there are no more references to it, and
> does not hang around on the lru list.
> 
> Nobody has to lock against the page count.  Each put_page_testzero caller
> only locks the data structure from which it's removing the reference.
> 
> This seems so simple, what is the flaw?

The flaw is in doing the put_page_testzero() outside of any locking
which would prevent other CPUs from finding and "rescuing" the zero-recount
page.

CPUA:
	if (put_page_testzero()) {
		/* Here's the window */
		spin_lock(lru_lock);
		list_del(page->lru);

CPUB:

	spin_lock(lru_lock);
	page = list_entry(lru);
	page_cache_get(page);	/* If this goes from 0->1, we die */
	...
	page_cache_release(page);	/* double free */


2.5.31-mm1 has tests which make this race enormously improbable [1],
but it's still there.

It's that `put' outside the lock which is the culprit.  Normally, we
handle that with atomic_dec_and_lock() (inodes) or by manipulating
the refcount inside an area which has exclusion (page presence in
pagecache).

The sane, sensible and sucky way is to always take the lock:

page_cache_release(page)
{
	spin_lock(lru_lock);
	if (put_page_testzero(page)) {
		lru_cache_del(page);
		__free_pages_ok(page, 0);
	}
	spin_unlock(lru_lock);
}

Because this provides exclusion from another CPU discovering the page
via the LRU.

So taking the above as the design principle, how can we speed it up?
How to avoid taking the lock in every page_cache_release()?  Maybe:

page_cache_release(page)
{
	if (page_count(page) == 1) {
		spin_lock(lru_lock);
		if (put_page_testzero(page)) {
			if (PageLRU(page))
				__lru_cache_del(page);
			__free_pages_ok(page);
		}
		spin_unlock(lru_lock);
	} else {
		atomic_dec(&page->count);
	}
}

This is nice and quick, but racy.  Two concurrent page_cache_releases
will create a zero-ref unfreed page which is on the LRU.  These are
rare, and can be mopped up in page reclaim.

The above code will also work for pages which aren't on the LRU.  It will
take the lock unnecessarily for (say) slab pages.  But if we put slab pages
on the LRU then I suspect there are so few non-LRU pages left that it isn't
worth bothering about this.


[1] The race requires that the CPU running page_cache_release find a
    five instruction window against the CPU running shrink_cache.  And
    that they be operating against the same page.  And that the CPU
    running __page_cache_release() then take an interrupt in a 3-4
    instruction window.  And that the interrupt take longer than the
    runtime for shrink_list.  And that the page be the first page in
    the pagevec.

    It's a heat-death-of-the-universe-race, but even if it were to be
    ignored, the current code is too complex.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

  reply	other threads:[~2002-08-26 19:24 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-08-22  2:29 Andrew Morton
2002-08-22 11:28 ` Christian Ehrhardt
2002-08-26  1:52   ` Andrew Morton
2002-08-26  9:10     ` Christian Ehrhardt
2002-08-26 14:22       ` Daniel Phillips
2002-08-26 15:29         ` Christian Ehrhardt
2002-08-26 17:56           ` Daniel Phillips
2002-08-26 19:24             ` Andrew Morton [this message]
2002-08-26 19:34               ` Daniel Phillips
2002-08-26 19:48               ` Christian Ehrhardt
2002-08-27  9:22               ` Christian Ehrhardt
2002-08-27 19:19                 ` Andrew Morton
2002-08-26 20:00             ` Christian Ehrhardt
2002-08-26 20:09               ` Daniel Phillips
2002-08-26 20:58                 ` Christian Ehrhardt
2002-08-27 16:48                   ` Daniel Phillips
2002-08-28 13:14                     ` Christian Ehrhardt
2002-08-28 17:18                       ` Daniel Phillips
2002-08-28 17:42                         ` Andrew Morton
2002-08-28 20:41                       ` Daniel Phillips
2002-08-28 21:03                         ` Andrew Morton
2002-08-28 22:04                           ` Daniel Phillips
2002-08-28 22:39                             ` Andrew Morton
2002-08-28 22:57                               ` Daniel Phillips
2002-08-26 21:31                 ` Andrew Morton
2002-08-27  3:42                   ` Benjamin LaHaise
2002-08-27  4:37                     ` Andrew Morton
2002-08-22 15:59 ` Steven Cole
2002-08-22 16:06   ` Martin J. Bligh
2002-08-22 19:45     ` Steven Cole
2002-08-26  2:15     ` Andrew Morton
2002-08-26  2:08       ` Martin J. Bligh
2002-08-26  2:32         ` Andrew Morton
2002-08-26  3:06           ` Steven Cole
2002-08-26 22:09 Ed Tomlinson
2002-08-26 23:58 ` Andrew Morton
2002-08-27  0:13   ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3D6A8082.3775C5AB@zip.com.au \
    --to=akpm@zip.com.au \
    --cc=ehrhardt@mathematik.uni-ulm.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=phillips@arcor.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox