MM patches against 2.5.31

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* MM patches against 2.5.31
@ 2002-08-22  2:29 Andrew Morton
  2002-08-22 11:28 ` Christian Ehrhardt
  2002-08-22 15:59 ` Steven Cole
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2002-08-22  2:29 UTC (permalink / raw)
  To: lkml, linux-mm

I've uploaded a rollup of pending fixes and feature work
against 2.5.31 to

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/

The rolled up patch there is suitable for ongoing testing and
development.  The individual patches are in the broken-out/
directory and should all be documented.


broken-out/linus.patch
  Incremental BK patch from Linus' tree

broken-out/page_reserved.patch
  Test PageReserved in pagevec_release()

broken-out/scsi_hack.patch
  Fix block-highmem for scsi

broken-out/page_cache_release_lru_fix.patch
  Fix a race between __page_cache_release() and shrink_cache().

broken-out/page_cache_release_fix.patch
  Fix __page_cache_release() bugs

broken-out/mvm.patch
  Fix vmalloc bugs

broken-out/pte-chain-fix.patch
  Fix a VM lockup on uniprocessors

broken-out/func-fix.patch
  gcc-2.91.66 does not support __func__

broken-out/ext3-htree.patch
  Indexed directories for ext3

broken-out/rmap-mapping-BUG.patch
  Fix a BUG_ON(page->mapping == NULL) in try_to_unmap()

broken-out/misc.patch
  misc fixlets

broken-out/tlb-speedup.patch
  Reduce typical global TLB invalidation frequency by 35%

broken-out/buffer-slab-align.patch
  Don't align the buffer_head slab on hardware cacheline boundaries

broken-out/zone-rename.patch
  Rename zone_struct->zone, zonelist_struct->zonelist.  Remove zone_t,
  zonelist_t.

broken-out/per-zone-lru.patch
  Per-zone page LRUs

broken-out/per-zone-lock.patch
  Per-zone LRU list locking

broken-out/l1-max-size.patch
  Infrastructure for determining the maximum L1 cache size which the kernel
  may have to support.

broken-out/zone-lock-alignment.patch
  Pad struct zone to ensure that the lru and buddy locks are in separate
  cachelines.

broken-out/put_page_cleanup.patch
  Clean up put_page() and page_cache_release().

broken-out/anon-batch-free.patch
  Batched freeing and de-LRUing of anonymous pages

broken-out/writeback-sync.patch
  Writeback fixes and tuneups

broken-out/ext3-inode-allocation.patch
  Fix an ext3 deadlock

broken-out/ext3-o_direct.patch
  O_DIRECT support for ext3.

broken-out/jfs-bio.patch
  Convert JFS to use direct-to-BIO I/O

broken-out/discontig-paddr_to_pfn.patch
  Convert page pointers into pfns for i386 NUMA

broken-out/discontig-setup_arch.patch
  Rework setup_arch() for i386 NUMA

broken-out/discontig-mem_init.patch
  Restructure mem_init for i386 NUMA

broken-out/discontig-i386-numa.patch
  discontigmem support for i386 NUMA

broken-out/cleanup-mem_map-1.patch
  Clean up lots of open-coded uese of mem_map[].  For ia32 NUMA

broken-out/zone-pages-reporting.patch
  Fix the boot-time reporting of each zone's available pages

broken-out/enospc-recovery-fix.patch
  Fix the __block_write_full_page() error path.

broken-out/fix-faults.patch
  Back out the initial work for atomic copy_*_user()

broken-out/bkl-consolidate.patch
  Consolidation per-arch lock_kenrel() implementations.

broken-out/might_sleep.patch
  Infrastructure to detect sleep-inside-spinlock bugs

broken-out/spin-lock-check.patch
  spinlock/rwlock checking infrastructure

broken-out/atomic-copy_user.patch
  Support for atomic copy_*_user()

broken-out/kmap_atomic_reads.patch
  Use kmap_atomic() for generic_file_read()

broken-out/kmap_atomic_writes.patch
  Use kmap_atomic() for generic_file_write()

broken-out/config-PAGE_OFFSET.patch
  Configurable kenrel/user memory split

broken-out/throttling-fix.patch
  Fix throttling of heavy write()rs.

broken-out/dirty-state-accounting.patch
  Make the global dirty memory accounting more accurate

broken-out/rd-cleanup.patch
  Cleanup and fix the ramdisk driver (doesn't work right yet)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22  2:29 MM patches against 2.5.31 Andrew Morton
@ 2002-08-22 11:28 ` Christian Ehrhardt
  2002-08-26  1:52   ` Andrew Morton
  2002-08-22 15:59 ` Steven Cole
  1 sibling, 1 reply; 37+ messages in thread
From: Christian Ehrhardt @ 2002-08-22 11:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote:
> 
> I've uploaded a rollup of pending fixes and feature work
> against 2.5.31 to
> 
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
> 
> The rolled up patch there is suitable for ongoing testing and
> development.  The individual patches are in the broken-out/
> directory and should all be documented.

Sorry, but we still have the page release race in multiple places.
Look at the following (page starts with page_count == 1):

Processor 1                          Processor 2
refill_inactive: lines 378-395
   as page count == 1 we'll
   continue with line 401

                                     __pagevec_release: line 138
				       calls release_pages
				     release_pages: line 100-111
				       put_page_test_zero brings the
				       page count to 0 and we'll continue
				       at line 114. Note that this may
				       happen while another processor holds
				       the lru lock, i.e. there is no
				       point in checking for page count == 0
				       with the lru lock held because
				       the lru lock doesn't protect against
				       decrements of page count after
				       the check.
  line 401: page_cache_get
  resurrects the page, page
  count is now 1.
  lines 402-448.
  line 448 calls __pagevec_release

__pagevec_release: line 138
  calls release_pages
release_pages: lines 100-111
  put_page_test_zero brings the
  page count back to 0 (!!!)
  i.e. we continue at line 114:

  lines 114-123.
  The page count == 0 check in line
  123 is successful and the page
  is returned to the buddy allocator
  
				       lines 114-123.
				       The page count == 0 check in line
				       123 is successful, i.e. the page
				       is returned to the buddy allocator
				       a second time. ===> BOOM


Neither the lru lock nor any of the page count == 0 checks can
prevent this from happening.

    regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22  2:29 MM patches against 2.5.31 Andrew Morton
  2002-08-22 11:28 ` Christian Ehrhardt
@ 2002-08-22 15:59 ` Steven Cole
  2002-08-22 16:06   ` Martin J. Bligh
  1 sibling, 1 reply; 37+ messages in thread
From: Steven Cole @ 2002-08-22 15:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Wed, 2002-08-21 at 20:29, Andrew Morton wrote:
> I've uploaded a rollup of pending fixes and feature work
> against 2.5.31 to
> 
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
> 
> The rolled up patch there is suitable for ongoing testing and
> development.  The individual patches are in the broken-out/
> directory and should all be documented.

The good news:  I ran my dbench 1..128 stress test and for the first
time since 2.5.31-vanilla there were _no_ BUG()s reported at all.

The other news:  from dmesg:
kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,3), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald: page allocation failure. order:0, mode:0x0

The kjournald failure message came out with dbench 48 running on an ext3
partition.  The test continued with only this one instance of this
message.

Steven

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22 15:59 ` Steven Cole
@ 2002-08-22 16:06   ` Martin J. Bligh
  2002-08-22 19:45     ` Steven Cole
  2002-08-26  2:15     ` Andrew Morton
  0 siblings, 2 replies; 37+ messages in thread
From: Martin J. Bligh @ 2002-08-22 16:06 UTC (permalink / raw)
  To: Steven Cole, Andrew Morton; +Cc: lkml, linux-mm

> kjournald: page allocation failure. order:0, mode:0x0

I've seen this before, but am curious how we ever passed
a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
that does this?

Thanks,

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22 16:06   ` Martin J. Bligh
@ 2002-08-22 19:45     ` Steven Cole
  2002-08-26  2:15     ` Andrew Morton
  1 sibling, 0 replies; 37+ messages in thread
From: Steven Cole @ 2002-08-22 19:45 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, lkml, linux-mm

On Thu, 2002-08-22 at 10:06, Martin J. Bligh wrote:
> > kjournald: page allocation failure. order:0, mode:0x0
> 
> I've seen this before, but am curious how we ever passed
> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> that does this?
> 
> Thanks,
> 
> M.

I ran dbench 1..128 on 2.5.31-mm1 several more times with nothing
unusual happening, and then got this from pdflush with dbench 96.

pdflush: page allocation failure. order:0, mode:0x0

FWIW, this 2.5.31-mm1 kernel is SMP, HIGHMEM4G, no PREEMPT.

Steven

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22 11:28 ` Christian Ehrhardt
@ 2002-08-26  1:52   ` Andrew Morton
  2002-08-26  9:10     ` Christian Ehrhardt
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2002-08-26  1:52 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: lkml, linux-mm

Christian Ehrhardt wrote:
> 
> On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote:
> >
> > I've uploaded a rollup of pending fixes and feature work
> > against 2.5.31 to
> >
> > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
> >
> > The rolled up patch there is suitable for ongoing testing and
> > development.  The individual patches are in the broken-out/
> > directory and should all be documented.
> 
> Sorry, but we still have the page release race in multiple places.
> Look at the following (page starts with page_count == 1):
> 

So we do.  It's a hugely improbable race, so there's no huge rush
to fix it.  Looks like the same race is present in -ac kernels,
actually, if add_to_swap() fails.  Also perhaps 2.4 is exposed if
swap_writepage() is synchronous, and page reclaim races with 
zap_page_range.  ho-hum.

What I'm inclined to do there is to change __page_cache_release()
to not attempt to free the page at all.  Just let it sit on the
LRU until page reclaim encounters it.  With the anon-free-via-pagevec
patch, very, very, very few pages actually get their final release in
__page_cache_release() - zero on uniprocessor, I expect.

And change pagevec_release() to take the LRU lock before dropping
the refcount on the pages.

That means there will need to be two flavours of pagevec_release():
one which expects the pages to become freeable (and takes the LRU
lock in anticipation of this).  And one which doesn't expect the
pages to become freeable.  The latter will leave the occasional
zero-count page on the LRU, as above.

Sound sane?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26  2:15     ` Andrew Morton
@ 2002-08-26  2:08       ` Martin J. Bligh
  2002-08-26  2:32         ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Martin J. Bligh @ 2002-08-26  2:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Steven Cole, lkml, linux-mm

>> > kjournald: page allocation failure. order:0, mode:0x0
>> 
>> I've seen this before, but am curious how we ever passed
>> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
>> that does this?
> 
> Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
> while kjournald is executing.  A radix-tree node allocation 
> on the add-to-swap path perhaps.  (The swapout failure messages
> aren't supposed to come out, but mempool_alloc() stomps on the
> caller's setting of PF_NOWARN.)
> 
> Or:
> 
> mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
>      89

No, GFP_ATOMIC is not 0:

#define __GFP_HIGH  0x20    /* Should access emergency pools? */
#define GFP_ATOMIC  (__GFP_HIGH)

Looking at all the options:

#define __GFP_WAIT  0x10    /* Can wait and reschedule? */
#define __GFP_HIGH  0x20    /* Should access emergency pools? */
#define __GFP_IO    0x40    /* Can start low memory physical IO? */
#define __GFP_HIGHIO    0x80    /* Can start high mem physical IO? */
#define __GFP_FS    0x100   /* Can call down to low-level FS? */

What worries me is that 0 seems to mean "you can't do anything
to try and free it, but you can't access the emergency pools either".
Seems doomed to failure to me. And the standard sets we have are

#define GFP_NOHIGHIO    (             __GFP_WAIT | __GFP_IO)
#define GFP_NOIO    (             __GFP_WAIT)
#define GFP_NOFS    (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO)
#define GFP_ATOMIC  (__GFP_HIGH)
#define GFP_USER    (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_HIGHUSER    (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS | __GFP_HIGHMEM)
#define GFP_KERNEL  (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_NFS     (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_KSWAPD  (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)

So I think someone's screwed something up, and this is accidental.
Or I'm just totally misunderstanding this, which is perfectly 
possible.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22 16:06   ` Martin J. Bligh
  2002-08-22 19:45     ` Steven Cole
@ 2002-08-26  2:15     ` Andrew Morton
  2002-08-26  2:08       ` Martin J. Bligh
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2002-08-26  2:15 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Steven Cole, lkml, linux-mm

"Martin J. Bligh" wrote:
> 
> > kjournald: page allocation failure. order:0, mode:0x0
> 
> I've seen this before, but am curious how we ever passed
> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> that does this?

Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
while kjournald is executing.  A radix-tree node allocation 
on the add-to-swap path perhaps.  (The swapout failure messages
aren't supposed to come out, but mempool_alloc() stomps on the
caller's setting of PF_NOWARN.)

Or:

mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
     89
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26  2:08       ` Martin J. Bligh
@ 2002-08-26  2:32         ` Andrew Morton
  2002-08-26  3:06           ` Steven Cole
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2002-08-26  2:32 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Steven Cole, lkml, linux-mm

"Martin J. Bligh" wrote:
> 
> >> > kjournald: page allocation failure. order:0, mode:0x0
> >>
> >> I've seen this before, but am curious how we ever passed
> >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> >> that does this?
> >
> > Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
> > while kjournald is executing.  A radix-tree node allocation
> > on the add-to-swap path perhaps.  (The swapout failure messages
> > aren't supposed to come out, but mempool_alloc() stomps on the
> > caller's setting of PF_NOWARN.)
> >
> > Or:
> >
> > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
> >      89
> 
> No, GFP_ATOMIC is not 0:
> 

It's mempool_alloc(GFP_NOIO) or such.  mempool_alloc() strips
__GFP_WAIT|__GFP_IO on the first attempt.

It also disables the printk, so maybe I just dunno ;)  show_stack()
would tell.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26  2:32         ` Andrew Morton
@ 2002-08-26  3:06           ` Steven Cole
  0 siblings, 0 replies; 37+ messages in thread
From: Steven Cole @ 2002-08-26  3:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin J. Bligh, lkml, linux-mm

On Sun, 2002-08-25 at 20:32, Andrew Morton wrote:
> "Martin J. Bligh" wrote:
> > 
> > >> > kjournald: page allocation failure. order:0, mode:0x0
> > >>
> > >> I've seen this before, but am curious how we ever passed
> > >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> > >> that does this?
> > >
> > > Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
> > > while kjournald is executing.  A radix-tree node allocation
> > > on the add-to-swap path perhaps.  (The swapout failure messages
> > > aren't supposed to come out, but mempool_alloc() stomps on the
> > > caller's setting of PF_NOWARN.)
> > >
> > > Or:
> > >
> > > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
> > >      89
> > 
> > No, GFP_ATOMIC is not 0:
> > 
> 
> It's mempool_alloc(GFP_NOIO) or such.  mempool_alloc() strips
> __GFP_WAIT|__GFP_IO on the first attempt.
> 
> It also disables the printk, so maybe I just dunno ;)  show_stack()
> would tell.
>
 
The "kjournald: page allocation failure. order:0, mode:0x0" message and
"pdflush: page allocation failure. order:0, mode:0x0" occurred only once
each on my dual p3 scsi ext3 test box running 2.5.31-mm1.  So, I added
something like this:
--- page_alloc.c.orig	Thu Aug 22 17:27:32 2002
+++ page_alloc.c	Thu Aug 22 17:29:24 2002
@@ -388,6 +388,8 @@
 			printk("%s: page allocation failure."
 				" order:%d, mode:0x%x\n",
 				current->comm, order, gfp_mask);
+			if (gfp_mask == 0)
+				BUG();
 		}
 		return NULL;
 	}

and continued testing on Friday with no repeats of the "page allocation failure"
messages.  I obtained a second dual p3 ext3 test box (ide this time) and left both
boxes running 2.5.31-mm1 and the dbench 1..128 stress test scripted to rerun many 
times over the weekend.  Due to a couple of firewalls, I can't look at those boxes
from here, but I'll let you know what happened in about 10 to 11 hours.

Cheers,
Steven

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26  1:52   ` Andrew Morton
@ 2002-08-26  9:10     ` Christian Ehrhardt
  2002-08-26 14:22       ` Daniel Phillips
  0 siblings, 1 reply; 37+ messages in thread
From: Christian Ehrhardt @ 2002-08-26  9:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Sun, Aug 25, 2002 at 06:52:55PM -0700, Andrew Morton wrote:
> Christian Ehrhardt wrote:
> > 
> > On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote:
> > >
> > > I've uploaded a rollup of pending fixes and feature work
> > > against 2.5.31 to
> > >
> > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
> > >
> > > The rolled up patch there is suitable for ongoing testing and
> > > development.  The individual patches are in the broken-out/
> > > directory and should all be documented.
> > 
> > Sorry, but we still have the page release race in multiple places.
> > Look at the following (page starts with page_count == 1):
> > 
> 
> So we do.  It's a hugely improbable race, so there's no huge rush
> to fix it.

Actually the race seems to happen in real life (it does explain the
the pte.chain != NULL BUG) and it is not that improbable with preempt.

> Looks like the same race is present in -ac kernels,
> actually, if add_to_swap() fails.  Also perhaps 2.4 is exposed if
> swap_writepage() is synchronous, and page reclaim races with 
> zap_page_range.  ho-hum.

I didn't check each kernel, but I agree that most of the recent kernels
have the potential for this race. Your tree just happend to be the
one where I found it first.

> What I'm inclined to do there is to change __page_cache_release()
> to not attempt to free the page at all.  Just let it sit on the
> LRU until page reclaim encounters it.  With the anon-free-via-pagevec
> patch, very, very, very few pages actually get their final release in
> __page_cache_release() - zero on uniprocessor, I expect.

It's not just a Problem with __page_cache_release, but yes it seems
to be a SMP only race.

> And change pagevec_release() to take the LRU lock before dropping
> the refcount on the pages.
> 
> That means there will need to be two flavours of pagevec_release():
> one which expects the pages to become freeable (and takes the LRU
> lock in anticipation of this).  And one which doesn't expect the
> pages to become freeable.  The latter will leave the occasional
> zero-count page on the LRU, as above.
> 
> Sound sane?

So the rules would be:
* if you bring the page count to zero call __free_pages_ok unless the
  page is on the lru.
* if someone (reclaim page) walking the lru finds a page with page count zero
  remove it from the lru and call __free_pages_ok.

This requires that ANYTHING that ends up calling put_page_testzero
must happen under the lru lock. This doesn't sound like a good idea
but it seems to be possible to do it race free.

I'd actually go for the following (patch in it compiles state but
otherwise untested below to illustrate what I'm talking about):

The basic idea is to move the logic into page_cache_get. The rules
would be:
1. if you bring the page count to zero (using put_page_testzero) it
   is your job to call __free_pages_ok eventually. Before doing so
   make sure that the page is no longer on the lru.
2. You may only call page_cache_get to duplicate an existing reference,
   i.e. page_cache_get could be made to BUG_ON(page_count(page) == 0).
3. If you got a pointer to a page without holding a reference (this
   is only allowd to happen if we found the pointer on an lru list)
   call page_cache_get_lru UNDER the lru lock and just leave the page
   alone if that would resurrect the page. page_cache_get_lru would
   basically look like this (implementation details below):

   int page_cache_get_lru (struct page * page) {
	if (!atomic_inc_and_test_for_one (&page->count))
		return 1;
	atomic_dec (&page->count);
	return 0;
   }

Proof of correctness:
A page is called dead if its page count reached zero before (no matter
what the page count currently is). Once a page is dead there can be
at most two pointers to the page: One held by the lru and the other
one held by the thread freeing the page. Any thread accessing the page
via the lru list will first call page_cache_get_lru under the lru lock,
the thread freeing the page will not read the page count anymore.
As page_cache_get_lru will not resurrect the page there will never
be a page count != 0 visible outside the lru lock on a dead page.
This meas that each thread trying to access the dead page via the lru
list will detect that the page is dead and leave it alone. It follows
that each page is freed at most once.
Suppose a page could be leaked under these rules. This would require
someone calling __put_page (or atomic_dec (&page->count)) to bring the
page count to zero on a not already dead page. However, the only place
where this happens is in page_cache_get_lru and it only happens if the
page is already dead.

Now let's look at the ugly part: implementation.
The basic problem is that we don't habe an atomic_inc_and_test_for_one
function and it is unlikely that we'll get one on all architectures. The
solution (and this is the ugly part) is a change in the semantics of
page->count. The value -1 now means no reference, 0 means one reference etc.

This way we can define
put_page_testzero    as   atomic_add_negative (-1, &page->count);   and
get_page_testzero    as   atomic_inc_and_test (&page->count);

Here's the promised (untested) patch against bk7 to illustrate what
I'm talking about:

diff -ur linux-2.5.31-bk7/include/linux/mm.h linux-2.5.31-cae/include/linux/mm.h
--- linux-2.5.31-bk7/include/linux/mm.h	Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/include/linux/mm.h	Sun Aug 25 21:40:57 2002
@@ -184,6 +184,11 @@
 /*
  * Methods to modify the page usage count.
  *
+ * NOTE: Real page counts start at -1 for no reference. This is a hack
+ * to be able to implement get_page_testzero with the existing portable
+ * atomic functions. The value exposed via set_page_count and page_count
+ * is (1+page->count).
+ *
  * What counts for a page usage:
  * - cache mapping   (page->mapping)
  * - private data    (page->private)
@@ -192,12 +197,25 @@
  *
  * Also, many kernel routines increase the page count before a critical
  * routine so they can be sure the page doesn't go away from under them.
+ *
+ * A special Problem is the lru lists. Presence on one of these lists
+ * does not increase the page count. The FIRST thread that brings the
+ * page count back to zero is responsible to remove the page from the
+ * lru list and actually free it (__free_pages_ok). This means that we
+ * can only get a reference to a page that is on a lru list, if this
+ * page is not already dead, i.e. about to be removed from the lru list.
+ * To do this we call get_page_testzero which will increment the page
+ * count and return true if we just resurrected the page i.e. the real
+ * page->count is now zero indicating one user. In this case we drop
+ * the reference again using __put_page. Both calls must happen under
+ * the lru lock.
  */
 #define get_page(p)		atomic_inc(&(p)->count)
 #define __put_page(p)		atomic_dec(&(p)->count)
-#define put_page_testzero(p) 	atomic_dec_and_test(&(p)->count)
-#define page_count(p)		atomic_read(&(p)->count)
-#define set_page_count(p,v) 	atomic_set(&(p)->count, v)
+#define put_page_testzero(p) 	atomic_add_negative(-1, &(p)->count)
+#define page_count(p)		(1+atomic_read(&(p)->count))
+#define set_page_count(p,v) 	atomic_set(&(p)->count, v-1)
+#define get_page_testzero(p)	atomic_inc_and_test(&(p)->count)
 extern void FASTCALL(__page_cache_release(struct page *));
 #define put_page(p)							\
 	do {								\
diff -ur linux-2.5.31-bk7/include/linux/pagemap.h linux-2.5.31-cae/include/linux/pagemap.h
--- linux-2.5.31-bk7/include/linux/pagemap.h	Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/include/linux/pagemap.h	Sun Aug 25 21:56:30 2002
@@ -22,7 +22,43 @@
 #define PAGE_CACHE_MASK		PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 
-#define page_cache_get(x)	get_page(x)
+/*
+ * Get a reference to the page. This function must not be called on
+ * a dead page, i.e. a page that has page count zero. If the page is
+ * still on a lru_list use page_cache_get_lru instead.
+ */
+static inline void page_cache_get (struct page * page)
+{
+	BUG_ON(page_count(page) == 0);
+	get_page(page);
+}
+
+/*
+ * Try to get a reference to page that we found on an lru list.
+ * The lru lists may contain pages with page count zero. We must
+ * not take a reference to such a page because it is already about
+ * to be freed (once it is of the lru lists). If we'd take a reference
+ * the page would eventually be freed twice.
+ * 
+ * The return value is true if we sucessfully incremented the page count.
+ * 
+ * This function must be called with the lru lock held. 
+ */
+static inline int page_cache_get_lru (struct page * page)
+{
+	/*
+	 * Yes there is a window where the page count is not zero
+	 * even though the page is dead. This is one of the reasons
+	 * why the caller must hold the lru lock. Due to the lru_lock
+	 * only the thread that is about to free the page can have
+	 * a reference to this page. This thread will not test the
+	 * page count anymore.
+	 */
+	if (!get_page_testzero (page))
+		return 1;
+	__put_page (page);
+	return 0;
+}
 
 static inline void page_cache_release(struct page *page)
 {
diff -ur linux-2.5.31-bk7/mm/swap.c linux-2.5.31-cae/mm/swap.c
--- linux-2.5.31-bk7/mm/swap.c	Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/mm/swap.c	Sun Aug 25 11:28:55 2002
@@ -77,7 +77,6 @@
 void __page_cache_release(struct page *page)
 {
 	unsigned long flags;
-	BUG_ON(page_count(page) != 0);
 
 	spin_lock_irqsave(&_pagemap_lru_lock, flags);
 	if (TestClearPageLRU(page)) {
@@ -86,11 +85,8 @@
 		else
 			del_page_from_inactive_list(page);
 	}
-	if (page_count(page) != 0)
-		page = NULL;
 	spin_unlock_irqrestore(&_pagemap_lru_lock, flags);
-	if (page)
-		__free_pages_ok(page, 0);
+	__free_pages_ok(page, 0);
 }
 
 /*
@@ -131,8 +127,7 @@
 			else
 				del_page_from_inactive_list(page);
 		}
-		if (page_count(page) == 0)
-			pagevec_add(&pages_to_free, page);
+		pagevec_add(&pages_to_free, page);
 	}
 	if (lock_held)
 		spin_unlock_irq(&_pagemap_lru_lock);
diff -ur linux-2.5.31-bk7/mm/vmscan.c linux-2.5.31-cae/mm/vmscan.c
--- linux-2.5.31-bk7/mm/vmscan.c	Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/mm/vmscan.c	Sun Aug 25 21:44:07 2002
@@ -92,6 +92,10 @@
 	return page_count(page) - !!PagePrivate(page) == 2;
 }
 
+/*
+ * The caller must hold a reference to each page in the list. We drop
+ * this reference if and only if we remove the page from the page_list.
+ */
 static /* inline */ int
 shrink_list(struct list_head *page_list, int nr_pages, zone_t *classzone,
 		unsigned int gfp_mask, int priority, int *max_scan)
@@ -295,24 +299,23 @@
 	spin_lock_irq(&_pagemap_lru_lock);
 	while (max_scan > 0 && nr_pages > 0) {
 		struct page *page;
+		struct list_head * curr;
 		int n = 0;
 
-		while (n < nr_to_process && !list_empty(&inactive_list)) {
-			page = list_entry(inactive_list.prev, struct page, lru);
+		curr = inactive_list.prev;
+		while (n < nr_to_process && (&inactive_list != curr)) {
+			page = list_entry(curr, struct page, lru);
 
-			prefetchw_prev_lru_page(page, &inactive_list, flags);
+			prefetchw_prev_lru_page(page, curr, flags);
+			curr = curr->prev;
 
+			/* Is the page already dead ? */
+			if (!page_cache_get_lru (page))
+				continue;
 			if (!TestClearPageLRU(page))
 				BUG();
 			list_del(&page->lru);
-			if (page_count(page) == 0) {
-				/* It is currently in pagevec_release() */
-				SetPageLRU(page);
-				list_add(&page->lru, &inactive_list);
-				continue;
-			}
 			list_add(&page->lru, &page_list);
-			page_cache_get(page);
 			n++;
 		}
 		spin_unlock_irq(&_pagemap_lru_lock);
@@ -381,15 +384,19 @@
 	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
 	struct page *page;
 	struct pagevec pvec;
+	struct list_head *curr;
 
 	lru_add_drain();
 	spin_lock_irq(&_pagemap_lru_lock);
-	while (nr_pages && !list_empty(&active_list)) {
-		page = list_entry(active_list.prev, struct page, lru);
-		prefetchw_prev_lru_page(page, &active_list, flags);
+	curr = active_list.prev;
+	while (nr_pages && (&active_list != curr)) {
+		page = list_entry(curr, struct page, lru);
+		prefetchw_prev_lru_page(page, curr, flags);
+		curr = curr->prev;
+		if (!page_cache_get_lru (page))
+			continue;
 		if (!TestClearPageLRU(page))
 			BUG();
-		page_cache_get(page);
 		list_move(&page->lru, &l_hold);
 		nr_pages--;
 	}

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26  9:10     ` Christian Ehrhardt
@ 2002-08-26 14:22       ` Daniel Phillips
  2002-08-26 15:29         ` Christian Ehrhardt
  0 siblings, 1 reply; 37+ messages in thread
From: Daniel Phillips @ 2002-08-26 14:22 UTC (permalink / raw)
  To: Christian Ehrhardt, Andrew Morton; +Cc: lkml, linux-mm

On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> + * A special Problem is the lru lists. Presence on one of these lists
> + * does not increase the page count.

Please remind me... why should it not?

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 14:22       ` Daniel Phillips
@ 2002-08-26 15:29         ` Christian Ehrhardt
  2002-08-26 17:56           ` Daniel Phillips
  0 siblings, 1 reply; 37+ messages in thread
From: Christian Ehrhardt @ 2002-08-26 15:29 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm

On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > + * A special Problem is the lru lists. Presence on one of these lists
> > + * does not increase the page count.
> 
> Please remind me... why should it not?

Pages that are only on the lru but not reference by anyone are of no
use and we want to free them immediatly. If we leave them on the lru
list with a page count of 1, someone else will have to walk the lru
list and remove pages that are only on the lru. One could argue that
try_to_free_pages could do this but try_to_free_pages will process the
pages in lru order and push out other pages first.
The next suggestion that comes to mind is: Let's have some magic in
page_cache_release that will remove the page from the lru list if
it is actually dead. However, this raises the question: How do we detect
that a page is now dead? The answer is something along the lines of

	if (put_page_testzero ()) {
		__free_pages_ok (page);
		return
	}
	spin_lock_irq(pagemap_lru_lock);
	if (PageLRU(page) && (page_count(page) == 1)) {
		lru_cache_del (page);
		spin_unlock_irq(pagemap_lru_lock);
		page_cache_release (page);
		return;
	}
	spin_unlock_irq(pagemap_lru_lock);
	return;

The sad truth is, that this solution has all the same races that
we have now, plus it makes the fast path (decreasing page count
to something not zero) slower. One problem in the above would be
that the last reference might as well not be due the the lru
cache, i.e at the time we call PageLRU(page) the page might
have been freed by another processor.

I know the idea is appealing (see one of my earlier Mails on the
subject ;-) ) but it doesn't solve the Problem.

      regards   Christian Ehrhardt

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 15:29         ` Christian Ehrhardt
@ 2002-08-26 17:56           ` Daniel Phillips
  2002-08-26 19:24             ` Andrew Morton
  2002-08-26 20:00             ` Christian Ehrhardt
  0 siblings, 2 replies; 37+ messages in thread
From: Daniel Phillips @ 2002-08-26 17:56 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm

On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > + * A special Problem is the lru lists. Presence on one of these lists
> > > + * does not increase the page count.
> > 
> > Please remind me... why should it not?
> 
> Pages that are only on the lru but not reference by anyone are of no
> use and we want to free them immediatly. If we leave them on the lru
> list with a page count of 1, someone else will have to walk the lru
> list and remove pages that are only on the lru.

I don't understand this argument.  Suppose lru list membership is worth a 
page count of one.  Then anyone who finds a page by way of the lru list can 
safely put_page_testzero and remove the page from the lru list.  Anyone who 
finds a page by way of a page table can likewise put_page_testzero and clear 
the pte, or remove the mapping and pass the page to Andrew's pagevec 
machinery, which will eventually do the put_page_testzero.  Anyone who 
removes a page from a radix tree will also do a put_page_testzero.  Exactly 
one of those paths will result in the page count reaching zero, which tells 
us nobody else holds a reference and it's time for __free_pages_ok.  The page 
is thus freed immediately as soon as there are no more references to it, and 
does not hang around on the lru list.

Nobody has to lock against the page count.  Each put_page_testzero caller 
only locks the data structure from which it's removing the reference.

This seems so simple, what is the flaw?

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 17:56           ` Daniel Phillips
@ 2002-08-26 19:24             ` Andrew Morton
  2002-08-26 19:34               ` Daniel Phillips
                                 ` (2 more replies)
  2002-08-26 20:00             ` Christian Ehrhardt
  1 sibling, 3 replies; 37+ messages in thread
From: Andrew Morton @ 2002-08-26 19:24 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm

Daniel Phillips wrote:
> 
> On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > + * does not increase the page count.
> > >
> > > Please remind me... why should it not?
> >
> > Pages that are only on the lru but not reference by anyone are of no
> > use and we want to free them immediatly. If we leave them on the lru
> > list with a page count of 1, someone else will have to walk the lru
> > list and remove pages that are only on the lru.
> 
> I don't understand this argument.  Suppose lru list membership is worth a
> page count of one.  Then anyone who finds a page by way of the lru list can
> safely put_page_testzero and remove the page from the lru list.  Anyone who
> finds a page by way of a page table can likewise put_page_testzero and clear
> the pte, or remove the mapping and pass the page to Andrew's pagevec
> machinery, which will eventually do the put_page_testzero.  Anyone who
> removes a page from a radix tree will also do a put_page_testzero.  Exactly
> one of those paths will result in the page count reaching zero, which tells
> us nobody else holds a reference and it's time for __free_pages_ok.  The page
> is thus freed immediately as soon as there are no more references to it, and
> does not hang around on the lru list.
> 
> Nobody has to lock against the page count.  Each put_page_testzero caller
> only locks the data structure from which it's removing the reference.
> 
> This seems so simple, what is the flaw?

The flaw is in doing the put_page_testzero() outside of any locking
which would prevent other CPUs from finding and "rescuing" the zero-recount
page.

CPUA:
	if (put_page_testzero()) {
		/* Here's the window */
		spin_lock(lru_lock);
		list_del(page->lru);

CPUB:

	spin_lock(lru_lock);
	page = list_entry(lru);
	page_cache_get(page);	/* If this goes from 0->1, we die */
	...
	page_cache_release(page);	/* double free */


2.5.31-mm1 has tests which make this race enormously improbable [1],
but it's still there.

It's that `put' outside the lock which is the culprit.  Normally, we
handle that with atomic_dec_and_lock() (inodes) or by manipulating
the refcount inside an area which has exclusion (page presence in
pagecache).

The sane, sensible and sucky way is to always take the lock:

page_cache_release(page)
{
	spin_lock(lru_lock);
	if (put_page_testzero(page)) {
		lru_cache_del(page);
		__free_pages_ok(page, 0);
	}
	spin_unlock(lru_lock);
}

Because this provides exclusion from another CPU discovering the page
via the LRU.

So taking the above as the design principle, how can we speed it up?
How to avoid taking the lock in every page_cache_release()?  Maybe:

page_cache_release(page)
{
	if (page_count(page) == 1) {
		spin_lock(lru_lock);
		if (put_page_testzero(page)) {
			if (PageLRU(page))
				__lru_cache_del(page);
			__free_pages_ok(page);
		}
		spin_unlock(lru_lock);
	} else {
		atomic_dec(&page->count);
	}
}

This is nice and quick, but racy.  Two concurrent page_cache_releases
will create a zero-ref unfreed page which is on the LRU.  These are
rare, and can be mopped up in page reclaim.

The above code will also work for pages which aren't on the LRU.  It will
take the lock unnecessarily for (say) slab pages.  But if we put slab pages
on the LRU then I suspect there are so few non-LRU pages left that it isn't
worth bothering about this.


[1] The race requires that the CPU running page_cache_release find a
    five instruction window against the CPU running shrink_cache.  And
    that they be operating against the same page.  And that the CPU
    running __page_cache_release() then take an interrupt in a 3-4
    instruction window.  And that the interrupt take longer than the
    runtime for shrink_list.  And that the page be the first page in
    the pagevec.

    It's a heat-death-of-the-universe-race, but even if it were to be
    ignored, the current code is too complex.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 19:24             ` Andrew Morton
@ 2002-08-26 19:34               ` Daniel Phillips
  2002-08-26 19:48               ` Christian Ehrhardt
  2002-08-27  9:22               ` Christian Ehrhardt
  2 siblings, 0 replies; 37+ messages in thread
From: Daniel Phillips @ 2002-08-26 19:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm

On Monday 26 August 2002 21:24, Andrew Morton wrote:
> Daniel Phillips wrote:
> > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > + * does not increase the page count.
> > > >
> > > > Please remind me... why should it not?
> > >
> > > Pages that are only on the lru but not reference by anyone are of no
> > > use and we want to free them immediatly. If we leave them on the lru
> > > list with a page count of 1, someone else will have to walk the lru
> > > list and remove pages that are only on the lru.
> > 
> > I don't understand this argument.  Suppose lru list membership is worth a
> > page count of one.  Then anyone who finds a page by way of the lru list can
> > safely put_page_testzero and remove the page from the lru list.  Anyone who
> > finds a page by way of a page table can likewise put_page_testzero and clear
> > the pte, or remove the mapping and pass the page to Andrew's pagevec
> > machinery, which will eventually do the put_page_testzero.  Anyone who
> > removes a page from a radix tree will also do a put_page_testzero.  Exactly
> > one of those paths will result in the page count reaching zero, which tells
> > us nobody else holds a reference and it's time for __free_pages_ok.  The page
> > is thus freed immediately as soon as there are no more references to it, and
> > does not hang around on the lru list.
> > 
> > Nobody has to lock against the page count.  Each put_page_testzero caller
> > only locks the data structure from which it's removing the reference.
> > 
> > This seems so simple, what is the flaw?
> 
> The flaw is in doing the put_page_testzero() outside of any locking
> which would prevent other CPUs from finding and "rescuing" the zero-recount
> page.
> 
> CPUA:
> 	if (put_page_testzero()) {
> 		/* Here's the window */
> 		spin_lock(lru_lock);
> 		list_del(page->lru);

According to my assumption that lru list membership is (should be) worth one 
page count, if testzero triggers here the page is not on the lru.

> CPUB:
> 
> 	spin_lock(lru_lock);
> 	page = list_entry(lru);
> 	page_cache_get(page);	/* If this goes from 0->1, we die */

It can't.  You know that because you found the page on the lru, its count
must be at least one (again, according to assumption above).

> 	...
> 	page_cache_release(page);	/* double free */

I'd like to jump in and chase more solutions with you, but the above doesn't 
prove your point, so I'm not ready to reject this one yet.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 19:24             ` Andrew Morton
  2002-08-26 19:34               ` Daniel Phillips
@ 2002-08-26 19:48               ` Christian Ehrhardt
  2002-08-27  9:22               ` Christian Ehrhardt
  2 siblings, 0 replies; 37+ messages in thread
From: Christian Ehrhardt @ 2002-08-26 19:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, lkml, linux-mm

On Mon, Aug 26, 2002 at 12:24:50PM -0700, Andrew Morton wrote:
> The flaw is in doing the put_page_testzero() outside of any locking

Well, one could argue that doing the put_page_testzero outside of any
locking is a feature.

>  [ ... ]
> 
> 2.5.31-mm1 has tests which make this race enormously improbable [1],
> but it's still there.

Agreed. Both on the improbable and on the still there part.

> It's that `put' outside the lock which is the culprit.  Normally, we
> handle that with atomic_dec_and_lock() (inodes) or by manipulating
> the refcount inside an area which has exclusion (page presence in
> pagecache).
> 
> The sane, sensible and sucky way is to always take the lock:
> 
> page_cache_release(page)
> {
> 	spin_lock(lru_lock);
> 	if (put_page_testzero(page)) {
> 		lru_cache_del(page);
> 		__free_pages_ok(page, 0);
> 	}
> 	spin_unlock(lru_lock);
> }

That would probably solve the problem.

> Because this provides exclusion from another CPU discovering the page
> via the LRU.
> 
> So taking the above as the design principle, how can we speed it up?
> How to avoid taking the lock in every page_cache_release()?  Maybe:
> 
> page_cache_release(page)
> {
> 	if (page_count(page) == 1) {
> 		spin_lock(lru_lock);
> 		if (put_page_testzero(page)) {
> 			if (PageLRU(page))
> 				__lru_cache_del(page);
> 			__free_pages_ok(page);
> 		}
> 		spin_unlock(lru_lock);
> 	} else {
> 		atomic_dec(&page->count);
> 	}
> }

However, this is an incredibly bad idea if the page is NOT on the lru.
Think of two instances of page_cache_release racing against each other.
This could result in a leaked page which is not on the LRU.

> This is nice and quick, but racy.  Two concurrent page_cache_releases
> will create a zero-ref unfreed page which is on the LRU.  These are
> rare, and can be mopped up in page reclaim.
> 
> The above code will also work for pages which aren't on the LRU.  It will
> take the lock unnecessarily for (say) slab pages.  But if we put slab pages
> on the LRU then I suspect there are so few non-LRU pages left that it isn't
> worth bothering about this.

No it will not work. See above.

> [1] The race requires that the CPU running page_cache_release find a
>     five instruction window against the CPU running shrink_cache.  And
>     that they be operating against the same page.  And that the CPU
>     running __page_cache_release() then take an interrupt in a 3-4
>     instruction window.  And that the interrupt take longer than the
>     runtime for shrink_list.  And that the page be the first page in
>     the pagevec.

The interrupt can also be a preemption which might easily take long
enough. But I agree that the race is now rare. The real problem is
that the locking rules don't guarantee that there are no other racy
paths that we both missed. 

    regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 17:56           ` Daniel Phillips
  2002-08-26 19:24             ` Andrew Morton
@ 2002-08-26 20:00             ` Christian Ehrhardt
  2002-08-26 20:09               ` Daniel Phillips
  1 sibling, 1 reply; 37+ messages in thread
From: Christian Ehrhardt @ 2002-08-26 20:00 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm

On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > + * does not increase the page count.
> > > 
> > > Please remind me... why should it not?
> > 
> > Pages that are only on the lru but not reference by anyone are of no
> > use and we want to free them immediatly. If we leave them on the lru
> > list with a page count of 1, someone else will have to walk the lru
> > list and remove pages that are only on the lru.
> 
> I don't understand this argument.  Suppose lru list membership is worth a 
> page count of one.  Then anyone who finds a page by way of the lru list can 

This does fix the double free problem but think of a typical anonymous
page at exit. The page is on the lru list and there is one reference held
by the pte. According to your scheme the pte reference would be freed
(obviously due to the exit) but the page would remain on the lru list.
However, there is no point in leaving the page on the lru list at all.

If you think about who is going to remove the page from the lru you'll
see the problem.

     regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 20:00             ` Christian Ehrhardt
@ 2002-08-26 20:09               ` Daniel Phillips
  2002-08-26 20:58                 ` Christian Ehrhardt
  2002-08-26 21:31                 ` Andrew Morton
  0 siblings, 2 replies; 37+ messages in thread
From: Daniel Phillips @ 2002-08-26 20:09 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm

On Monday 26 August 2002 22:00, Christian Ehrhardt wrote:
> On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > + * does not increase the page count.
> > > > 
> > > > Please remind me... why should it not?
> > > 
> > > Pages that are only on the lru but not reference by anyone are of no
> > > use and we want to free them immediatly. If we leave them on the lru
> > > list with a page count of 1, someone else will have to walk the lru
> > > list and remove pages that are only on the lru.
> > 
> > I don't understand this argument.  Suppose lru list membership is worth a 
> > page count of one.  Then anyone who finds a page by way of the lru list can 
> 
> This does fix the double free problem but think of a typical anonymous
> page at exit. The page is on the lru list and there is one reference held
> by the pte. According to your scheme the pte reference would be freed
> (obviously due to the exit) but the page would remain on the lru list.
> However, there is no point in leaving the page on the lru list at all.

If you want the page off the lru list at that point (which you probably do)
then you take the lru lock and put_page_testzero.

> If you think about who is going to remove the page from the lru you'll
> see the problem.

Nope, still don't see it.  Whoever hits put_page_testzero frees the page,
secure in the knowlege that there are no other references to it.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 20:09               ` Daniel Phillips
@ 2002-08-26 20:58                 ` Christian Ehrhardt
  2002-08-27 16:48                   ` Daniel Phillips
  2002-08-26 21:31                 ` Andrew Morton
  1 sibling, 1 reply; 37+ messages in thread
From: Christian Ehrhardt @ 2002-08-26 20:58 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm

On Mon, Aug 26, 2002 at 10:09:38PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 22:00, Christian Ehrhardt wrote:
> > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > > + * does not increase the page count.
> > > > > 
> > > > > Please remind me... why should it not?
> > > > 
> > > > Pages that are only on the lru but not reference by anyone are of no
> > > > use and we want to free them immediatly. If we leave them on the lru
> > > > list with a page count of 1, someone else will have to walk the lru
> > > > list and remove pages that are only on the lru.
> > > 
> > > I don't understand this argument.  Suppose lru list membership is worth a 
> > > page count of one.  Then anyone who finds a page by way of the lru list can 
> > 
> > This does fix the double free problem but think of a typical anonymous
> > page at exit. The page is on the lru list and there is one reference held
> > by the pte. According to your scheme the pte reference would be freed
> > (obviously due to the exit) but the page would remain on the lru list.
> > However, there is no point in leaving the page on the lru list at all.
> 
> If you want the page off the lru list at that point (which you probably do)
> then you take the lru lock and put_page_testzero.

Could you clarify what you mean with "at that point"? Especially how
do you plan to test for "this point". Besides it is illegal to use
the page after put_page_testzero (unless put_page_testzero returns true).

> > If you think about who is going to remove the page from the lru you'll
> > see the problem.
> 
> Nope, still don't see it.  Whoever hits put_page_testzero frees the page,
> secure in the knowlege that there are no other references to it.

Well yes, but we cannot remove the page from the lru atomatically
at page_cache_release time if we follow your proposal. If you think we can,
show me your implementation of page_cache_release and I'll show
you where the races are (unless you do everything under the lru_lock
of course).

    regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 20:09               ` Daniel Phillips
  2002-08-26 20:58                 ` Christian Ehrhardt
@ 2002-08-26 21:31                 ` Andrew Morton
  2002-08-27  3:42                   ` Benjamin LaHaise
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2002-08-26 21:31 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm

Daniel Phillips wrote:
> 
> ...
> > If you think about who is going to remove the page from the lru you'll
> > see the problem.
> 
> Nope, still don't see it.  Whoever hits put_page_testzero frees the page,
> secure in the knowlege that there are no other references to it.

Sure. But this requires that the caller of page_cache_release() has
previously removed the page from the LRU.  We (used to) do that for truncate
and page reclaim.   But we did not do that for anon pages.

For anon pages, we perform magical LRU removal when the page refcount
goes to zero.

The fact that we performed explicit removal in one place, and magical removal
in the other was unfortunate.  I nuked the explicit removal and made it
all magical (explicit removal in truncate_complete_page() was wrong anyway - the
page could have been rescued and anonymised by concurrent pagefault and must
stay on the LRU).

Possibly, we could go back to explicit removal everywhere.   Haven't
really looked at that, but I suspect we're back to a similar problem:
to do you unracily determine whether the page should be removed from
the LRU?  Take ->page_table_lock and look at page_count(page)?  Worried.

I like the magical-removal-just-before-free, and my gut feel is that
it'll provide a cleaner end result.

Making presence on the LRU contribute to page->count is attractive,
if only because it removes some irritating and expensive page_cache_gets
and puts from shrink_cache and refill_inactive.  But for it to be useful,
we must perform explicit removal everywhere.

Making presence on the LRU contribute to page->count doesn't fundamentally
change anything of course - it offsets the current problems by one.

Then again, it would remove all page_cache_gets/releases from vmscan.c
and may thus make the race go away.  That's a bit of a timebomb though.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 21:31                 ` Andrew Morton
@ 2002-08-27  3:42                   ` Benjamin LaHaise
  2002-08-27  4:37                     ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Benjamin LaHaise @ 2002-08-27  3:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, Christian Ehrhardt, lkml, linux-mm

On Mon, Aug 26, 2002 at 02:31:57PM -0700, Andrew Morton wrote:
> I like the magical-removal-just-before-free, and my gut feel is that
> it'll provide a cleaner end result.

For the record, I'd rather see explicite removal everwhere.  We received 
a number of complaints along the lines of "I run my app immediately after 
system startup, and it's fast, but the second time it's slower" due to 
the lazy page reclaim in early 2.4.  Until there's a way to make LRU 
scanning faster than page allocation, it can't be lazy.

		-ben
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-27  3:42                   ` Benjamin LaHaise
@ 2002-08-27  4:37                     ` Andrew Morton
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2002-08-27  4:37 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Daniel Phillips, Christian Ehrhardt, lkml, linux-mm

Benjamin LaHaise wrote:
> 
> On Mon, Aug 26, 2002 at 02:31:57PM -0700, Andrew Morton wrote:
> > I like the magical-removal-just-before-free, and my gut feel is that
> > it'll provide a cleaner end result.
> 
> For the record, I'd rather see explicite removal everwhere.  We received
> a number of complaints along the lines of "I run my app immediately after
> system startup, and it's fast, but the second time it's slower" due to
> the lazy page reclaim in early 2.4.  Until there's a way to make LRU
> scanning faster than page allocation, it can't be lazy.
> 

I think that's what Rik was referring to.

But here, "explicit removal" refers to running lru_cache_del() prior
to the final put_page, rather than within the context of the final
put_page.  So it's a different thing.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 19:24             ` Andrew Morton
  2002-08-26 19:34               ` Daniel Phillips
  2002-08-26 19:48               ` Christian Ehrhardt
@ 2002-08-27  9:22               ` Christian Ehrhardt
  2002-08-27 19:19                 ` Andrew Morton
  2 siblings, 1 reply; 37+ messages in thread
From: Christian Ehrhardt @ 2002-08-27  9:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, lkml, linux-mm

On Mon, Aug 26, 2002 at 12:24:50PM -0700, Andrew Morton wrote:
> The flaw is in doing the put_page_testzero() outside of any locking
> which would prevent other CPUs from finding and "rescuing" the zero-recount
> page.
> 
> CPUA:
> 	if (put_page_testzero()) {
> 		/* Here's the window */
> 		spin_lock(lru_lock);
> 		list_del(page->lru);
> 
> CPUB:
> 
> 	spin_lock(lru_lock);
> 	page = list_entry(lru);
> 	page_cache_get(page);	/* If this goes from 0->1, we die */
> 	...
> 	page_cache_release(page);	/* double free */

So what we want CPUB do instead is

	spin_lock(lru_lock);
	page = list_entry(lru)

	START ATOMIC 
		page_cache_get(page);
		res = (page_count (page) == 1)
	END ATOMIC

	if (res) {
		atomic_dec (&page->count);
		continue;  /* with next page */
	}
	...
	page_cache_release (page);

I.e. we want to detect _atomically_ that we just raised the page count
from zero to one. My patch actually has a solution that implements the
needed atomic operation above by means of the atomic functions that we
currently have on all archs (it's called get_page_testzero and
should probably called get_page_testone).
The more I think about this the more I think this is the way to go.

      regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 20:58                 ` Christian Ehrhardt
@ 2002-08-27 16:48                   ` Daniel Phillips
  2002-08-28 13:14                     ` Christian Ehrhardt
  0 siblings, 1 reply; 37+ messages in thread
From: Daniel Phillips @ 2002-08-27 16:48 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm

On Monday 26 August 2002 22:58, Christian Ehrhardt wrote:
> On Mon, Aug 26, 2002 at 10:09:38PM +0200, Daniel Phillips wrote:
> > On Monday 26 August 2002 22:00, Christian Ehrhardt wrote:
> > > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> > > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > > > + * does not increase the page count.
> > > > > > 
> > > > > > Please remind me... why should it not?
> > > > > 
> > > > > Pages that are only on the lru but not reference by anyone are of no
> > > > > use and we want to free them immediatly. If we leave them on the lru
> > > > > list with a page count of 1, someone else will have to walk the lru
> > > > > list and remove pages that are only on the lru.
> > > > 
> > > > I don't understand this argument.  Suppose lru list membership is worth a 
> > > > page count of one.  Then anyone who finds a page by way of the lru list can 
> > > 
> > > This does fix the double free problem but think of a typical anonymous
> > > page at exit. The page is on the lru list and there is one reference held
> > > by the pte. According to your scheme the pte reference would be freed
> > > (obviously due to the exit) but the page would remain on the lru list.
> > > However, there is no point in leaving the page on the lru list at all.
> > 
> > If you want the page off the lru list at that point (which you probably do)
> > then you take the lru lock and put_page_testzero.
> 
> Could you clarify what you mean with "at that point"? Especially how
> do you plan to test for "this point".  Besides it is illegal to use
> the page after put_page_testzero (unless put_page_testzero returns true).

> > > If you think about who is going to remove the page from the lru you'll
> > > see the problem.
> > 
> > Nope, still don't see it.  Whoever hits put_page_testzero frees the page,
> > secure in the knowlege that there are no other references to it.
> 
> Well yes, but we cannot remove the page from the lru atomatically
> at page_cache_release time if we follow your proposal. If you think we can,
> show me your implementation of page_cache_release and I'll show
> you where the races are (unless you do everything under the lru_lock
> of course).

void page_cache_release(struct page *page)
{
	spin_lock(&pagemap_lru_lock);
	if (PageLRU(page) && page_count(page) == 2) {
		__lru_cache_del(page);
		atomic_dec(&page->count);
	}
	spin_unlock(&pagemap_lru_lock);
	if (put_page_testzero(page))
		__free_pages_ok(page, 0);
}

This allows the following benign race, with initial page count = 3:

spin_lock(&pagemap_lru_lock);
if (PageLRU(page) && page_count(page) == 2) /* false */
spin_unlock(&pagemap_lru_lock);
						spin_lock(&pagemap_lru_lock);
						if (PageLRU(page) && page_count(page) == 2) /* false */
						spin_unlock(&pagemap_lru_lock);
						if (put_page_testzero(page))
							__free_pages_ok(page, 0);
if (put_page_testzero(page))
	__free_pages_ok(page, 0);

Neither holder of a page reference sees the count at 2, and so the page
is left on the lru with count = 1.  This won't happen often and such
pages will be recovered from the cold end of the list in due course.

The important question is: can this code ever remove a page from the lru
erroneously, leaving somebody holding a reference to a non-lru page?  In
other words, can the test PageLRU(page) && page_count(page) == 2 return
a false positive?  Well, when this test is true we can account for both
both references: the one we own, and the one the lru list owns.  Since
we hold the lru lock, the latter won't change.  Nobody else has the
right to increment the page count, since they must inherit that right
from somebody who holds a reference, and there are none.

We could also do this:

void page_cache_release(struct page *page)
{
	if (page_count(page) == 2) {
		spin_lock(&pagemap_lru_lock);
		if (PageLRU(page) && page_count(page) == 2) {
			__lru_cache_del(page);
			atomic_dec(&page->count);
		}
		spin_unlock(&pagemap_lru_lock);
	}
	if (put_page_testzero(page))
		__free_pages_ok(page, 0);
}

Which avoids taking the lru lock sometimes in exchange for widening the
hole through which pages can end up with count = 1 on the lru list.

Let's run this through your race detector and see what happens.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-27  9:22               ` Christian Ehrhardt
@ 2002-08-27 19:19                 ` Andrew Morton
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2002-08-27 19:19 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Daniel Phillips, lkml, linux-mm

Christian Ehrhardt wrote:
> 
> ...
> So what we want CPUB do instead is
> 
>         spin_lock(lru_lock);
>         page = list_entry(lru)
> 
>         START ATOMIC
>                 page_cache_get(page);
>                 res = (page_count (page) == 1)
>         END ATOMIC
> 
>         if (res) {
>                 atomic_dec (&page->count);
>                 continue;  /* with next page */
>         }
>         ...
>         page_cache_release (page);
> 
> I.e. we want to detect _atomically_ that we just raised the page count
> from zero to one. My patch actually has a solution that implements the
> needed atomic operation above by means of the atomic functions that we
> currently have on all archs (it's called get_page_testzero and
> should probably called get_page_testone).
> The more I think about this the more I think this is the way to go.
> 

Yes, I think that would provide a minimal fix to the problem.
(I'd prefer a solution in which presence on the LRU contributes
to page->count, because that means I can dump a load of expensive
page_cache_get-inside-lru-lock instances, but whatever)

You had:

-#define put_page_testzero(p)   atomic_dec_and_test(&(p)->count)
-#define page_count(p)          atomic_read(&(p)->count)
-#define set_page_count(p,v)    atomic_set(&(p)->count, v)
+#define put_page_testzero(p)   atomic_add_negative(-1, &(p)->count)
+#define page_count(p)          (1+atomic_read(&(p)->count))
+#define set_page_count(p,v)    atomic_set(&(p)->count, v-1)
+#define get_page_testzero(p)   atomic_inc_and_test(&(p)->count)

So the page count is actually offset by -1, and that is hidden by
the macros.  Fair enough.

atomic_add_negative() is not implemented on quite a number of
architectures (sparc64, mips, ppc, sh, cris, 68k, alpha..), so
some legwork is needed there.  Looks to be pretty simple though;
alpha, ppc and others already have atomic_add_return().
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-27 16:48                   ` Daniel Phillips
@ 2002-08-28 13:14                     ` Christian Ehrhardt
  2002-08-28 17:18                       ` Daniel Phillips
  2002-08-28 20:41                       ` Daniel Phillips
  0 siblings, 2 replies; 37+ messages in thread
From: Christian Ehrhardt @ 2002-08-28 13:14 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm

On Tue, Aug 27, 2002 at 06:48:50PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 22:58, Christian Ehrhardt wrote:
> > > Nope, still don't see it.  Whoever hits put_page_testzero frees the page,
> > > secure in the knowlege that there are no other references to it.
> > 
> > Well yes, but we cannot remove the page from the lru atomatically
> > at page_cache_release time if we follow your proposal. If you think we can,
> > show me your implementation of page_cache_release and I'll show
> > you where the races are (unless you do everything under the lru_lock
> > of course).
> 
> void page_cache_release(struct page *page)
> {
> 	spin_lock(&pagemap_lru_lock);
> 	if (PageLRU(page) && page_count(page) == 2) {
> 		__lru_cache_del(page);
> 		atomic_dec(&page->count);
> 	}
> 	spin_unlock(&pagemap_lru_lock);
> 	if (put_page_testzero(page))
> 		__free_pages_ok(page, 0);
> }
>
> This allows the following benign race, with initial page count = 3:
> [ ...]
> Neither holder of a page reference sees the count at 2, and so the page
> is left on the lru with count = 1.  This won't happen often and such
> pages will be recovered from the cold end of the list in due course.

Ok, agreed. I think this will work but taking the lru lock each time
is probably not a good idea.

> We could also do this:
> 
> void page_cache_release(struct page *page)
> {
> 	if (page_count(page) == 2) {
> 		spin_lock(&pagemap_lru_lock);
> 		if (PageLRU(page) && page_count(page) == 2) {
> 			__lru_cache_del(page);
> 			atomic_dec(&page->count);
> 		}
> 		spin_unlock(&pagemap_lru_lock);
> 	}
> 	if (put_page_testzero(page))
> 		__free_pages_ok(page, 0);
> }
> 
> Which avoids taking the lru lock sometimes in exchange for widening the
> hole through which pages can end up with count = 1 on the lru list.

This sounds like something that is worth trying. I missed that one.


Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive
or shrink_cache could have removed the page from the lru before
__pagevec_lru_del acquired the lru lock.

     regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 13:14                     ` Christian Ehrhardt
@ 2002-08-28 17:18                       ` Daniel Phillips
  2002-08-28 17:42                         ` Andrew Morton
  2002-08-28 20:41                       ` Daniel Phillips
  1 sibling, 1 reply; 37+ messages in thread
From: Daniel Phillips @ 2002-08-28 17:18 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm

On Wednesday 28 August 2002 15:14, Christian Ehrhardt wrote:
> Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive
> or shrink_cache could have removed the page from the lru before
> __pagevec_lru_del acquired the lru lock.

It's suspect all right.  If there's a chain of assumptions that proves
the page is always on the lru at the point, I haven't seen it yet.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 17:18                       ` Daniel Phillips
@ 2002-08-28 17:42                         ` Andrew Morton
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2002-08-28 17:42 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm

Daniel Phillips wrote:
> 
> On Wednesday 28 August 2002 15:14, Christian Ehrhardt wrote:
> > Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive
> > or shrink_cache could have removed the page from the lru before
> > __pagevec_lru_del acquired the lru lock.
> 
> It's suspect all right.  If there's a chain of assumptions that proves
> the page is always on the lru at the point, I haven't seen it yet.

Yeah.  __pagevec_lru_del is only used by invalidate_inode_pages.
A very simple solution is to just delete it.

untested code:

 include/linux/pagevec.h |    7 -------
 mm/filemap.c            |   10 +++++-----
 mm/swap.c               |   28 ----------------------------
 3 files changed, 5 insertions(+), 40 deletions(-)

--- 2.5.32/mm/filemap.c~pagevec_lru_del	Wed Aug 28 09:51:51 2002
+++ 2.5.32-akpm/mm/filemap.c	Wed Aug 28 09:51:51 2002
@@ -116,10 +116,10 @@ void invalidate_inode_pages(struct inode
 	struct list_head *head, *curr;
 	struct page * page;
 	struct address_space *mapping = inode->i_mapping;
-	struct pagevec lru_pvec;
+	struct pagevec pvec;
 
 	head = &mapping->clean_pages;
-	pagevec_init(&lru_pvec);
+	pagevec_init(&pvec);
 	write_lock(&mapping->page_lock);
 	curr = head->next;
 
@@ -143,8 +143,8 @@ void invalidate_inode_pages(struct inode
 
 		__remove_from_page_cache(page);
 		unlock_page(page);
-		if (!pagevec_add(&lru_pvec, page))
-			__pagevec_lru_del(&lru_pvec);
+		if (!pagevec_add(&pvec, page))
+			__pagevec_release(&pvec);
 		continue;
 unlock:
 		unlock_page(page);
@@ -152,7 +152,7 @@ unlock:
 	}
 
 	write_unlock(&mapping->page_lock);
-	pagevec_lru_del(&lru_pvec);
+	pagevec_release(&pvec);
 }
 
 static int do_invalidatepage(struct page *page, unsigned long offset)
--- 2.5.32/include/linux/pagevec.h~pagevec_lru_del	Wed Aug 28 09:51:51 2002
+++ 2.5.32-akpm/include/linux/pagevec.h	Wed Aug 28 09:51:51 2002
@@ -18,7 +18,6 @@ void __pagevec_release(struct pagevec *p
 void __pagevec_release_nonlru(struct pagevec *pvec);
 void __pagevec_free(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
-void __pagevec_lru_del(struct pagevec *pvec);
 void lru_add_drain(void);
 void pagevec_deactivate_inactive(struct pagevec *pvec);
 
@@ -69,9 +68,3 @@ static inline void pagevec_lru_add(struc
 	if (pagevec_count(pvec))
 		__pagevec_lru_add(pvec);
 }
-
-static inline void pagevec_lru_del(struct pagevec *pvec)
-{
-	if (pagevec_count(pvec))
-		__pagevec_lru_del(pvec);
-}
--- 2.5.32/mm/swap.c~pagevec_lru_del	Wed Aug 28 09:51:51 2002
+++ 2.5.32-akpm/mm/swap.c	Wed Aug 28 09:51:58 2002
@@ -214,34 +214,6 @@ void __pagevec_lru_add(struct pagevec *p
 }
 
 /*
- * Remove the passed pages from the LRU, then drop the caller's refcount on
- * them.  Reinitialises the caller's pagevec.
- */
-void __pagevec_lru_del(struct pagevec *pvec)
-{
-	int i;
-	struct zone *zone = NULL;
-
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		struct zone *pagezone = page_zone(page);
-
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
-			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
-		}
-		if (!TestClearPageLRU(page))
-			BUG();
-		del_page_from_lru(zone, page);
-	}
-	if (zone)
-		spin_unlock_irq(&zone->lru_lock);
-	pagevec_release(pvec);
-}
-
-/*
  * Perform any setup for the swap system
  */
 void __init swap_setup(void)

.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 13:14                     ` Christian Ehrhardt
  2002-08-28 17:18                       ` Daniel Phillips
@ 2002-08-28 20:41                       ` Daniel Phillips
  2002-08-28 21:03                         ` Andrew Morton
  1 sibling, 1 reply; 37+ messages in thread
From: Daniel Phillips @ 2002-08-28 20:41 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm

Going right back to basics, what do you suppose is wrong with the 2.4 
strategy of always doing the lru removal in free_pages_ok?

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 20:41                       ` Daniel Phillips
@ 2002-08-28 21:03                         ` Andrew Morton
  2002-08-28 22:04                           ` Daniel Phillips
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2002-08-28 21:03 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm

Daniel Phillips wrote:
> 
> Going right back to basics, what do you suppose is wrong with the 2.4
> strategy of always doing the lru removal in free_pages_ok?

That's equivalent to what we have at present, which is:

	if (put_page_testzero(page)) {
		/* window here */
		lru_cache_del(page);
		__free_pages_ok(page, 0);
	}

versus:

	spin_lock(lru lock);
	page = list_entry(lru, ...);
	if (page_count(page) == 0)
		continue;
	/* window here */
	page_cache_get(page);
	page_cache_release(page);	/* double-free */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 21:03                         ` Andrew Morton
@ 2002-08-28 22:04                           ` Daniel Phillips
  2002-08-28 22:39                             ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Daniel Phillips @ 2002-08-28 22:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm

On Wednesday 28 August 2002 23:03, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > Going right back to basics, what do you suppose is wrong with the 2.4
> > strategy of always doing the lru removal in free_pages_ok?
> 
> That's equivalent to what we have at present, which is:
> 
> 	if (put_page_testzero(page)) {
> 		/* window here */
> 		lru_cache_del(page);
> 		__free_pages_ok(page, 0);
> 	}
> 
> versus:
> 
> 	spin_lock(lru lock);
> 	page = list_entry(lru, ...);
> 	if (page_count(page) == 0)
> 		continue;
> 	/* window here */
> 	page_cache_get(page);
> 	page_cache_release(page);	/* double-free */

Indeed it is.  In 2.4.19 we have:

(vmscan.c: shrink_cache)                        (page_alloc.c: __free_pages)

365       if (unlikely(!page_count(page)))
366               continue;
					        444         if (!PageReserved(page) && put_page_testzero(page))
          [many twisty paths, all different]
511       /* effectively free the page here */
512       page_cache_release(page);
					        445                 __free_pages_ok(page, order);
                                                [free it again just to make sure]

So there's no question that the race is lurking in 2.4.  I noticed several
more paths besides the one above that look suspicious as well.  The bottom
line is, 2.4 needs a fix along the lines of my suggestion or Christian's,
something that can actually be proved.

It's a wonder that this problem manifests so rarely in practice.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 22:04                           ` Daniel Phillips
@ 2002-08-28 22:39                             ` Andrew Morton
  2002-08-28 22:57                               ` Daniel Phillips
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2002-08-28 22:39 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm

Daniel Phillips wrote:
> 
> ...
> So there's no question that the race is lurking in 2.4.  I noticed several
> more paths besides the one above that look suspicious as well.  The bottom
> line is, 2.4 needs a fix along the lines of my suggestion or Christian's,
> something that can actually be proved.
> 
> It's a wonder that this problem manifests so rarely in practice.

I sort-of glanced through the 2.4 paths and it appears that in all of the
places where it could do a page_cache_get/release, that would never happen
because of other parts of the page state.

Like: it can't be in pagecache, so we won't run writepage, and
it can't have buffers, so we won't run try_to_release_page().

Of course, I might have missed a path.  And, well, generally: ugh.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 22:39                             ` Andrew Morton
@ 2002-08-28 22:57                               ` Daniel Phillips
  0 siblings, 0 replies; 37+ messages in thread
From: Daniel Phillips @ 2002-08-28 22:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm

On Thursday 29 August 2002 00:39, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > ...
> > So there's no question that the race is lurking in 2.4.  I noticed several
> > more paths besides the one above that look suspicious as well.  The bottom
> > line is, 2.4 needs a fix along the lines of my suggestion or Christian's,
> > something that can actually be proved.
> > 
> > It's a wonder that this problem manifests so rarely in practice.
> 
> I sort-of glanced through the 2.4 paths and it appears that in all of the
> places where it could do a page_cache_get/release, that would never happen
> because of other parts of the page state.
> 
> Like: it can't be in pagecache, so we won't run writepage, and
> it can't have buffers, so we won't run try_to_release_page().
> 
> Of course, I might have missed a path.  And, well, generally: ugh.

I think it is happening.  I just went sifting searching through the archives
on 'oops' and '2.4'.  The first one I found was:

   2.4.18-xfs (xfs related?) oops report

which fits the description nicely.

The race I showed actually causes the page->count to go negative, avoiding
a double free on a technicality.  That doesn't make me feel much better about
it.  Have you got a BUG_ON(!page_count(page)) in put_page_testzero?  I think
we might see some action.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 23:58 ` Andrew Morton
@ 2002-08-27  0:13   ` Rik van Riel
  0 siblings, 0 replies; 37+ messages in thread
From: Rik van Riel @ 2002-08-27  0:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ed Tomlinson, linux-mm, linux-kernel, Christian Ehrhardt,
	Daniel Phillips

On Mon, 26 Aug 2002, Andrew Morton wrote:

> Well we wouldn't want to leave tons of free pages on the LRU - the VM
> would needlessly reclaim pagecache before finding the free pages.  And
> higher-order page allocations could suffer.

We did this with the swap cache in 2.4.<7 and it was an
absolute disaster.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 22:09 Ed Tomlinson
@ 2002-08-26 23:58 ` Andrew Morton
  2002-08-27  0:13   ` Rik van Riel
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2002-08-26 23:58 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm, linux-kernel, Christian Ehrhardt, Daniel Phillips

Ed Tomlinson wrote:
> 
> This seems to have been missed:

Still thinking about it.

> Linus Torvalds wrote:
> 
> > In article <3D6989F7.9ED1948A@zip.com.au>,
> > Andrew Morton  <akpm@zip.com.au> wrote:
> >>
> >>What I'm inclined to do there is to change __page_cache_release()
> >>to not attempt to free the page at all.  Just let it sit on the
> >>LRU until page reclaim encounters it.  With the anon-free-via-pagevec
> >>patch, very, very, very few pages actually get their final release in
> >>__page_cache_release() - zero on uniprocessor, I expect.
> >
> > If you do this, then I would personally suggest a conceptually different
> > approach: make the LRU list count towards the page count.  That will
> > _automatically_ result in what you describe - if a page is on the LRU
> > list, then "freeing" it will always just decrement the count, and the
> > _real_ free comes from walking the LRU list and considering count==1 to
> > be trivially freeable.
> >
> > That way you don't have to have separate functions for releasing
> > different kinds of pages (we've seen how nasty that was from a
> > maintainance standpoint already with the "put_page vs
> > page_cache_release" thing).
> >
> > Ehh?
> 
> If every structure locks before removing its reference (ie before testing and/or
> removing a lru reference we take zone->lru_lock, for slabs take cachep->spinlock
> etc)  Its a bit of an audit task to make sure the various locks are taken (and
> documented) though.
> 
> By leting the actual free be lazy as Linus suggests things should simplify nicely.

Well we wouldn't want to leave tons of free pages on the LRU - the
VM would needlessly reclaim pagecache before finding the free pages.  And
higher-order page allocations could suffer.

If we go for explicit lru removal in truncate and zap_pte_range
then this approach may be best.  Still thinking about it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: MM patches against 2.5.31
@ 2002-08-26 22:09 Ed Tomlinson
  2002-08-26 23:58 ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Ed Tomlinson @ 2002-08-26 22:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Andrew Morton, Christian Ehrhardt, Daniel Phillips

This seems to have been missed: 

Linus Torvalds wrote:

> In article <3D6989F7.9ED1948A@zip.com.au>,
> Andrew Morton  <akpm@zip.com.au> wrote:
>>
>>What I'm inclined to do there is to change __page_cache_release()
>>to not attempt to free the page at all.  Just let it sit on the
>>LRU until page reclaim encounters it.  With the anon-free-via-pagevec
>>patch, very, very, very few pages actually get their final release in
>>__page_cache_release() - zero on uniprocessor, I expect.
> 
> If you do this, then I would personally suggest a conceptually different
> approach: make the LRU list count towards the page count.  That will
> _automatically_ result in what you describe - if a page is on the LRU
> list, then "freeing" it will always just decrement the count, and the
> _real_ free comes from walking the LRU list and considering count==1 to
> be trivially freeable.
> 
> That way you don't have to have separate functions for releasing
> different kinds of pages (we've seen how nasty that was from a
> maintainance standpoint already with the "put_page vs
> page_cache_release" thing).
> 
> Ehh? 

If every structure locks before removing its reference (ie before testing and/or
removing a lru reference we take zone->lru_lock, for slabs take cachep->spinlock
etc)  Its a bit of an audit task to make sure the various locks are taken (and
documented) though.

By leting the actual free be lazy as Linus suggests things should simplify nicely.

comments,
Ed Tomlinson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2002-08-28 22:57 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-22  2:29 MM patches against 2.5.31 Andrew Morton
2002-08-22 11:28 ` Christian Ehrhardt
2002-08-26  1:52   ` Andrew Morton
2002-08-26  9:10     ` Christian Ehrhardt
2002-08-26 14:22       ` Daniel Phillips
2002-08-26 15:29         ` Christian Ehrhardt
2002-08-26 17:56           ` Daniel Phillips
2002-08-26 19:24             ` Andrew Morton
2002-08-26 19:34               ` Daniel Phillips
2002-08-26 19:48               ` Christian Ehrhardt
2002-08-27  9:22               ` Christian Ehrhardt
2002-08-27 19:19                 ` Andrew Morton
2002-08-26 20:00             ` Christian Ehrhardt
2002-08-26 20:09               ` Daniel Phillips
2002-08-26 20:58                 ` Christian Ehrhardt
2002-08-27 16:48                   ` Daniel Phillips
2002-08-28 13:14                     ` Christian Ehrhardt
2002-08-28 17:18                       ` Daniel Phillips
2002-08-28 17:42                         ` Andrew Morton
2002-08-28 20:41                       ` Daniel Phillips
2002-08-28 21:03                         ` Andrew Morton
2002-08-28 22:04                           ` Daniel Phillips
2002-08-28 22:39                             ` Andrew Morton
2002-08-28 22:57                               ` Daniel Phillips
2002-08-26 21:31                 ` Andrew Morton
2002-08-27  3:42                   ` Benjamin LaHaise
2002-08-27  4:37                     ` Andrew Morton
2002-08-22 15:59 ` Steven Cole
2002-08-22 16:06   ` Martin J. Bligh
2002-08-22 19:45     ` Steven Cole
2002-08-26  2:15     ` Andrew Morton
2002-08-26  2:08       ` Martin J. Bligh
2002-08-26  2:32         ` Andrew Morton
2002-08-26  3:06           ` Steven Cole
2002-08-26 22:09 Ed Tomlinson
2002-08-26 23:58 ` Andrew Morton
2002-08-27  0:13   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox