* MM patches against 2.5.31 @ 2002-08-22 2:29 Andrew Morton 2002-08-22 11:28 ` Christian Ehrhardt 2002-08-22 15:59 ` Steven Cole 0 siblings, 2 replies; 37+ messages in thread From: Andrew Morton @ 2002-08-22 2:29 UTC (permalink / raw) To: lkml, linux-mm I've uploaded a rollup of pending fixes and feature work against 2.5.31 to http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/ The rolled up patch there is suitable for ongoing testing and development. The individual patches are in the broken-out/ directory and should all be documented. broken-out/linus.patch Incremental BK patch from Linus' tree broken-out/page_reserved.patch Test PageReserved in pagevec_release() broken-out/scsi_hack.patch Fix block-highmem for scsi broken-out/page_cache_release_lru_fix.patch Fix a race between __page_cache_release() and shrink_cache(). broken-out/page_cache_release_fix.patch Fix __page_cache_release() bugs broken-out/mvm.patch Fix vmalloc bugs broken-out/pte-chain-fix.patch Fix a VM lockup on uniprocessors broken-out/func-fix.patch gcc-2.91.66 does not support __func__ broken-out/ext3-htree.patch Indexed directories for ext3 broken-out/rmap-mapping-BUG.patch Fix a BUG_ON(page->mapping == NULL) in try_to_unmap() broken-out/misc.patch misc fixlets broken-out/tlb-speedup.patch Reduce typical global TLB invalidation frequency by 35% broken-out/buffer-slab-align.patch Don't align the buffer_head slab on hardware cacheline boundaries broken-out/zone-rename.patch Rename zone_struct->zone, zonelist_struct->zonelist. Remove zone_t, zonelist_t. broken-out/per-zone-lru.patch Per-zone page LRUs broken-out/per-zone-lock.patch Per-zone LRU list locking broken-out/l1-max-size.patch Infrastructure for determining the maximum L1 cache size which the kernel may have to support. broken-out/zone-lock-alignment.patch Pad struct zone to ensure that the lru and buddy locks are in separate cachelines. broken-out/put_page_cleanup.patch Clean up put_page() and page_cache_release(). broken-out/anon-batch-free.patch Batched freeing and de-LRUing of anonymous pages broken-out/writeback-sync.patch Writeback fixes and tuneups broken-out/ext3-inode-allocation.patch Fix an ext3 deadlock broken-out/ext3-o_direct.patch O_DIRECT support for ext3. broken-out/jfs-bio.patch Convert JFS to use direct-to-BIO I/O broken-out/discontig-paddr_to_pfn.patch Convert page pointers into pfns for i386 NUMA broken-out/discontig-setup_arch.patch Rework setup_arch() for i386 NUMA broken-out/discontig-mem_init.patch Restructure mem_init for i386 NUMA broken-out/discontig-i386-numa.patch discontigmem support for i386 NUMA broken-out/cleanup-mem_map-1.patch Clean up lots of open-coded uese of mem_map[]. For ia32 NUMA broken-out/zone-pages-reporting.patch Fix the boot-time reporting of each zone's available pages broken-out/enospc-recovery-fix.patch Fix the __block_write_full_page() error path. broken-out/fix-faults.patch Back out the initial work for atomic copy_*_user() broken-out/bkl-consolidate.patch Consolidation per-arch lock_kenrel() implementations. broken-out/might_sleep.patch Infrastructure to detect sleep-inside-spinlock bugs broken-out/spin-lock-check.patch spinlock/rwlock checking infrastructure broken-out/atomic-copy_user.patch Support for atomic copy_*_user() broken-out/kmap_atomic_reads.patch Use kmap_atomic() for generic_file_read() broken-out/kmap_atomic_writes.patch Use kmap_atomic() for generic_file_write() broken-out/config-PAGE_OFFSET.patch Configurable kenrel/user memory split broken-out/throttling-fix.patch Fix throttling of heavy write()rs. broken-out/dirty-state-accounting.patch Make the global dirty memory accounting more accurate broken-out/rd-cleanup.patch Cleanup and fix the ramdisk driver (doesn't work right yet) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 2:29 MM patches against 2.5.31 Andrew Morton @ 2002-08-22 11:28 ` Christian Ehrhardt 2002-08-26 1:52 ` Andrew Morton 2002-08-22 15:59 ` Steven Cole 1 sibling, 1 reply; 37+ messages in thread From: Christian Ehrhardt @ 2002-08-22 11:28 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, linux-mm On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote: > > I've uploaded a rollup of pending fixes and feature work > against 2.5.31 to > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/ > > The rolled up patch there is suitable for ongoing testing and > development. The individual patches are in the broken-out/ > directory and should all be documented. Sorry, but we still have the page release race in multiple places. Look at the following (page starts with page_count == 1): Processor 1 Processor 2 refill_inactive: lines 378-395 as page count == 1 we'll continue with line 401 __pagevec_release: line 138 calls release_pages release_pages: line 100-111 put_page_test_zero brings the page count to 0 and we'll continue at line 114. Note that this may happen while another processor holds the lru lock, i.e. there is no point in checking for page count == 0 with the lru lock held because the lru lock doesn't protect against decrements of page count after the check. line 401: page_cache_get resurrects the page, page count is now 1. lines 402-448. line 448 calls __pagevec_release __pagevec_release: line 138 calls release_pages release_pages: lines 100-111 put_page_test_zero brings the page count back to 0 (!!!) i.e. we continue at line 114: lines 114-123. The page count == 0 check in line 123 is successful and the page is returned to the buddy allocator lines 114-123. The page count == 0 check in line 123 is successful, i.e. the page is returned to the buddy allocator a second time. ===> BOOM Neither the lru lock nor any of the page count == 0 checks can prevent this from happening. regards Christian -- THAT'S ALL FOLKS! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 11:28 ` Christian Ehrhardt @ 2002-08-26 1:52 ` Andrew Morton 2002-08-26 9:10 ` Christian Ehrhardt 0 siblings, 1 reply; 37+ messages in thread From: Andrew Morton @ 2002-08-26 1:52 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: lkml, linux-mm Christian Ehrhardt wrote: > > On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote: > > > > I've uploaded a rollup of pending fixes and feature work > > against 2.5.31 to > > > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/ > > > > The rolled up patch there is suitable for ongoing testing and > > development. The individual patches are in the broken-out/ > > directory and should all be documented. > > Sorry, but we still have the page release race in multiple places. > Look at the following (page starts with page_count == 1): > So we do. It's a hugely improbable race, so there's no huge rush to fix it. Looks like the same race is present in -ac kernels, actually, if add_to_swap() fails. Also perhaps 2.4 is exposed if swap_writepage() is synchronous, and page reclaim races with zap_page_range. ho-hum. What I'm inclined to do there is to change __page_cache_release() to not attempt to free the page at all. Just let it sit on the LRU until page reclaim encounters it. With the anon-free-via-pagevec patch, very, very, very few pages actually get their final release in __page_cache_release() - zero on uniprocessor, I expect. And change pagevec_release() to take the LRU lock before dropping the refcount on the pages. That means there will need to be two flavours of pagevec_release(): one which expects the pages to become freeable (and takes the LRU lock in anticipation of this). And one which doesn't expect the pages to become freeable. The latter will leave the occasional zero-count page on the LRU, as above. Sound sane? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 1:52 ` Andrew Morton @ 2002-08-26 9:10 ` Christian Ehrhardt 2002-08-26 14:22 ` Daniel Phillips 0 siblings, 1 reply; 37+ messages in thread From: Christian Ehrhardt @ 2002-08-26 9:10 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, linux-mm On Sun, Aug 25, 2002 at 06:52:55PM -0700, Andrew Morton wrote: > Christian Ehrhardt wrote: > > > > On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote: > > > > > > I've uploaded a rollup of pending fixes and feature work > > > against 2.5.31 to > > > > > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/ > > > > > > The rolled up patch there is suitable for ongoing testing and > > > development. The individual patches are in the broken-out/ > > > directory and should all be documented. > > > > Sorry, but we still have the page release race in multiple places. > > Look at the following (page starts with page_count == 1): > > > > So we do. It's a hugely improbable race, so there's no huge rush > to fix it. Actually the race seems to happen in real life (it does explain the the pte.chain != NULL BUG) and it is not that improbable with preempt. > Looks like the same race is present in -ac kernels, > actually, if add_to_swap() fails. Also perhaps 2.4 is exposed if > swap_writepage() is synchronous, and page reclaim races with > zap_page_range. ho-hum. I didn't check each kernel, but I agree that most of the recent kernels have the potential for this race. Your tree just happend to be the one where I found it first. > What I'm inclined to do there is to change __page_cache_release() > to not attempt to free the page at all. Just let it sit on the > LRU until page reclaim encounters it. With the anon-free-via-pagevec > patch, very, very, very few pages actually get their final release in > __page_cache_release() - zero on uniprocessor, I expect. It's not just a Problem with __page_cache_release, but yes it seems to be a SMP only race. > And change pagevec_release() to take the LRU lock before dropping > the refcount on the pages. > > That means there will need to be two flavours of pagevec_release(): > one which expects the pages to become freeable (and takes the LRU > lock in anticipation of this). And one which doesn't expect the > pages to become freeable. The latter will leave the occasional > zero-count page on the LRU, as above. > > Sound sane? So the rules would be: * if you bring the page count to zero call __free_pages_ok unless the page is on the lru. * if someone (reclaim page) walking the lru finds a page with page count zero remove it from the lru and call __free_pages_ok. This requires that ANYTHING that ends up calling put_page_testzero must happen under the lru lock. This doesn't sound like a good idea but it seems to be possible to do it race free. I'd actually go for the following (patch in it compiles state but otherwise untested below to illustrate what I'm talking about): The basic idea is to move the logic into page_cache_get. The rules would be: 1. if you bring the page count to zero (using put_page_testzero) it is your job to call __free_pages_ok eventually. Before doing so make sure that the page is no longer on the lru. 2. You may only call page_cache_get to duplicate an existing reference, i.e. page_cache_get could be made to BUG_ON(page_count(page) == 0). 3. If you got a pointer to a page without holding a reference (this is only allowd to happen if we found the pointer on an lru list) call page_cache_get_lru UNDER the lru lock and just leave the page alone if that would resurrect the page. page_cache_get_lru would basically look like this (implementation details below): int page_cache_get_lru (struct page * page) { if (!atomic_inc_and_test_for_one (&page->count)) return 1; atomic_dec (&page->count); return 0; } Proof of correctness: A page is called dead if its page count reached zero before (no matter what the page count currently is). Once a page is dead there can be at most two pointers to the page: One held by the lru and the other one held by the thread freeing the page. Any thread accessing the page via the lru list will first call page_cache_get_lru under the lru lock, the thread freeing the page will not read the page count anymore. As page_cache_get_lru will not resurrect the page there will never be a page count != 0 visible outside the lru lock on a dead page. This meas that each thread trying to access the dead page via the lru list will detect that the page is dead and leave it alone. It follows that each page is freed at most once. Suppose a page could be leaked under these rules. This would require someone calling __put_page (or atomic_dec (&page->count)) to bring the page count to zero on a not already dead page. However, the only place where this happens is in page_cache_get_lru and it only happens if the page is already dead. Now let's look at the ugly part: implementation. The basic problem is that we don't habe an atomic_inc_and_test_for_one function and it is unlikely that we'll get one on all architectures. The solution (and this is the ugly part) is a change in the semantics of page->count. The value -1 now means no reference, 0 means one reference etc. This way we can define put_page_testzero as atomic_add_negative (-1, &page->count); and get_page_testzero as atomic_inc_and_test (&page->count); Here's the promised (untested) patch against bk7 to illustrate what I'm talking about: diff -ur linux-2.5.31-bk7/include/linux/mm.h linux-2.5.31-cae/include/linux/mm.h --- linux-2.5.31-bk7/include/linux/mm.h Sun Aug 25 18:30:38 2002 +++ linux-2.5.31-cae/include/linux/mm.h Sun Aug 25 21:40:57 2002 @@ -184,6 +184,11 @@ /* * Methods to modify the page usage count. * + * NOTE: Real page counts start at -1 for no reference. This is a hack + * to be able to implement get_page_testzero with the existing portable + * atomic functions. The value exposed via set_page_count and page_count + * is (1+page->count). + * * What counts for a page usage: * - cache mapping (page->mapping) * - private data (page->private) @@ -192,12 +197,25 @@ * * Also, many kernel routines increase the page count before a critical * routine so they can be sure the page doesn't go away from under them. + * + * A special Problem is the lru lists. Presence on one of these lists + * does not increase the page count. The FIRST thread that brings the + * page count back to zero is responsible to remove the page from the + * lru list and actually free it (__free_pages_ok). This means that we + * can only get a reference to a page that is on a lru list, if this + * page is not already dead, i.e. about to be removed from the lru list. + * To do this we call get_page_testzero which will increment the page + * count and return true if we just resurrected the page i.e. the real + * page->count is now zero indicating one user. In this case we drop + * the reference again using __put_page. Both calls must happen under + * the lru lock. */ #define get_page(p) atomic_inc(&(p)->count) #define __put_page(p) atomic_dec(&(p)->count) -#define put_page_testzero(p) atomic_dec_and_test(&(p)->count) -#define page_count(p) atomic_read(&(p)->count) -#define set_page_count(p,v) atomic_set(&(p)->count, v) +#define put_page_testzero(p) atomic_add_negative(-1, &(p)->count) +#define page_count(p) (1+atomic_read(&(p)->count)) +#define set_page_count(p,v) atomic_set(&(p)->count, v-1) +#define get_page_testzero(p) atomic_inc_and_test(&(p)->count) extern void FASTCALL(__page_cache_release(struct page *)); #define put_page(p) \ do { \ diff -ur linux-2.5.31-bk7/include/linux/pagemap.h linux-2.5.31-cae/include/linux/pagemap.h --- linux-2.5.31-bk7/include/linux/pagemap.h Sun Aug 25 18:30:38 2002 +++ linux-2.5.31-cae/include/linux/pagemap.h Sun Aug 25 21:56:30 2002 @@ -22,7 +22,43 @@ #define PAGE_CACHE_MASK PAGE_MASK #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK) -#define page_cache_get(x) get_page(x) +/* + * Get a reference to the page. This function must not be called on + * a dead page, i.e. a page that has page count zero. If the page is + * still on a lru_list use page_cache_get_lru instead. + */ +static inline void page_cache_get (struct page * page) +{ + BUG_ON(page_count(page) == 0); + get_page(page); +} + +/* + * Try to get a reference to page that we found on an lru list. + * The lru lists may contain pages with page count zero. We must + * not take a reference to such a page because it is already about + * to be freed (once it is of the lru lists). If we'd take a reference + * the page would eventually be freed twice. + * + * The return value is true if we sucessfully incremented the page count. + * + * This function must be called with the lru lock held. + */ +static inline int page_cache_get_lru (struct page * page) +{ + /* + * Yes there is a window where the page count is not zero + * even though the page is dead. This is one of the reasons + * why the caller must hold the lru lock. Due to the lru_lock + * only the thread that is about to free the page can have + * a reference to this page. This thread will not test the + * page count anymore. + */ + if (!get_page_testzero (page)) + return 1; + __put_page (page); + return 0; +} static inline void page_cache_release(struct page *page) { diff -ur linux-2.5.31-bk7/mm/swap.c linux-2.5.31-cae/mm/swap.c --- linux-2.5.31-bk7/mm/swap.c Sun Aug 25 18:30:38 2002 +++ linux-2.5.31-cae/mm/swap.c Sun Aug 25 11:28:55 2002 @@ -77,7 +77,6 @@ void __page_cache_release(struct page *page) { unsigned long flags; - BUG_ON(page_count(page) != 0); spin_lock_irqsave(&_pagemap_lru_lock, flags); if (TestClearPageLRU(page)) { @@ -86,11 +85,8 @@ else del_page_from_inactive_list(page); } - if (page_count(page) != 0) - page = NULL; spin_unlock_irqrestore(&_pagemap_lru_lock, flags); - if (page) - __free_pages_ok(page, 0); + __free_pages_ok(page, 0); } /* @@ -131,8 +127,7 @@ else del_page_from_inactive_list(page); } - if (page_count(page) == 0) - pagevec_add(&pages_to_free, page); + pagevec_add(&pages_to_free, page); } if (lock_held) spin_unlock_irq(&_pagemap_lru_lock); diff -ur linux-2.5.31-bk7/mm/vmscan.c linux-2.5.31-cae/mm/vmscan.c --- linux-2.5.31-bk7/mm/vmscan.c Sun Aug 25 18:30:38 2002 +++ linux-2.5.31-cae/mm/vmscan.c Sun Aug 25 21:44:07 2002 @@ -92,6 +92,10 @@ return page_count(page) - !!PagePrivate(page) == 2; } +/* + * The caller must hold a reference to each page in the list. We drop + * this reference if and only if we remove the page from the page_list. + */ static /* inline */ int shrink_list(struct list_head *page_list, int nr_pages, zone_t *classzone, unsigned int gfp_mask, int priority, int *max_scan) @@ -295,24 +299,23 @@ spin_lock_irq(&_pagemap_lru_lock); while (max_scan > 0 && nr_pages > 0) { struct page *page; + struct list_head * curr; int n = 0; - while (n < nr_to_process && !list_empty(&inactive_list)) { - page = list_entry(inactive_list.prev, struct page, lru); + curr = inactive_list.prev; + while (n < nr_to_process && (&inactive_list != curr)) { + page = list_entry(curr, struct page, lru); - prefetchw_prev_lru_page(page, &inactive_list, flags); + prefetchw_prev_lru_page(page, curr, flags); + curr = curr->prev; + /* Is the page already dead ? */ + if (!page_cache_get_lru (page)) + continue; if (!TestClearPageLRU(page)) BUG(); list_del(&page->lru); - if (page_count(page) == 0) { - /* It is currently in pagevec_release() */ - SetPageLRU(page); - list_add(&page->lru, &inactive_list); - continue; - } list_add(&page->lru, &page_list); - page_cache_get(page); n++; } spin_unlock_irq(&_pagemap_lru_lock); @@ -381,15 +384,19 @@ LIST_HEAD(l_active); /* Pages to go onto the active_list */ struct page *page; struct pagevec pvec; + struct list_head *curr; lru_add_drain(); spin_lock_irq(&_pagemap_lru_lock); - while (nr_pages && !list_empty(&active_list)) { - page = list_entry(active_list.prev, struct page, lru); - prefetchw_prev_lru_page(page, &active_list, flags); + curr = active_list.prev; + while (nr_pages && (&active_list != curr)) { + page = list_entry(curr, struct page, lru); + prefetchw_prev_lru_page(page, curr, flags); + curr = curr->prev; + if (!page_cache_get_lru (page)) + continue; if (!TestClearPageLRU(page)) BUG(); - page_cache_get(page); list_move(&page->lru, &l_hold); nr_pages--; } -- THAT'S ALL FOLKS! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 9:10 ` Christian Ehrhardt @ 2002-08-26 14:22 ` Daniel Phillips 2002-08-26 15:29 ` Christian Ehrhardt 0 siblings, 1 reply; 37+ messages in thread From: Daniel Phillips @ 2002-08-26 14:22 UTC (permalink / raw) To: Christian Ehrhardt, Andrew Morton; +Cc: lkml, linux-mm On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > + * A special Problem is the lru lists. Presence on one of these lists > + * does not increase the page count. Please remind me... why should it not? -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 14:22 ` Daniel Phillips @ 2002-08-26 15:29 ` Christian Ehrhardt 2002-08-26 17:56 ` Daniel Phillips 0 siblings, 1 reply; 37+ messages in thread From: Christian Ehrhardt @ 2002-08-26 15:29 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > + * A special Problem is the lru lists. Presence on one of these lists > > + * does not increase the page count. > > Please remind me... why should it not? Pages that are only on the lru but not reference by anyone are of no use and we want to free them immediatly. If we leave them on the lru list with a page count of 1, someone else will have to walk the lru list and remove pages that are only on the lru. One could argue that try_to_free_pages could do this but try_to_free_pages will process the pages in lru order and push out other pages first. The next suggestion that comes to mind is: Let's have some magic in page_cache_release that will remove the page from the lru list if it is actually dead. However, this raises the question: How do we detect that a page is now dead? The answer is something along the lines of if (put_page_testzero ()) { __free_pages_ok (page); return } spin_lock_irq(pagemap_lru_lock); if (PageLRU(page) && (page_count(page) == 1)) { lru_cache_del (page); spin_unlock_irq(pagemap_lru_lock); page_cache_release (page); return; } spin_unlock_irq(pagemap_lru_lock); return; The sad truth is, that this solution has all the same races that we have now, plus it makes the fast path (decreasing page count to something not zero) slower. One problem in the above would be that the last reference might as well not be due the the lru cache, i.e at the time we call PageLRU(page) the page might have been freed by another processor. I know the idea is appealing (see one of my earlier Mails on the subject ;-) ) but it doesn't solve the Problem. regards Christian Ehrhardt -- THAT'S ALL FOLKS! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 15:29 ` Christian Ehrhardt @ 2002-08-26 17:56 ` Daniel Phillips 2002-08-26 19:24 ` Andrew Morton 2002-08-26 20:00 ` Christian Ehrhardt 0 siblings, 2 replies; 37+ messages in thread From: Daniel Phillips @ 2002-08-26 17:56 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > + * A special Problem is the lru lists. Presence on one of these lists > > > + * does not increase the page count. > > > > Please remind me... why should it not? > > Pages that are only on the lru but not reference by anyone are of no > use and we want to free them immediatly. If we leave them on the lru > list with a page count of 1, someone else will have to walk the lru > list and remove pages that are only on the lru. I don't understand this argument. Suppose lru list membership is worth a page count of one. Then anyone who finds a page by way of the lru list can safely put_page_testzero and remove the page from the lru list. Anyone who finds a page by way of a page table can likewise put_page_testzero and clear the pte, or remove the mapping and pass the page to Andrew's pagevec machinery, which will eventually do the put_page_testzero. Anyone who removes a page from a radix tree will also do a put_page_testzero. Exactly one of those paths will result in the page count reaching zero, which tells us nobody else holds a reference and it's time for __free_pages_ok. The page is thus freed immediately as soon as there are no more references to it, and does not hang around on the lru list. Nobody has to lock against the page count. Each put_page_testzero caller only locks the data structure from which it's removing the reference. This seems so simple, what is the flaw? -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 17:56 ` Daniel Phillips @ 2002-08-26 19:24 ` Andrew Morton 2002-08-26 19:34 ` Daniel Phillips ` (2 more replies) 2002-08-26 20:00 ` Christian Ehrhardt 1 sibling, 3 replies; 37+ messages in thread From: Andrew Morton @ 2002-08-26 19:24 UTC (permalink / raw) To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm Daniel Phillips wrote: > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > + * does not increase the page count. > > > > > > Please remind me... why should it not? > > > > Pages that are only on the lru but not reference by anyone are of no > > use and we want to free them immediatly. If we leave them on the lru > > list with a page count of 1, someone else will have to walk the lru > > list and remove pages that are only on the lru. > > I don't understand this argument. Suppose lru list membership is worth a > page count of one. Then anyone who finds a page by way of the lru list can > safely put_page_testzero and remove the page from the lru list. Anyone who > finds a page by way of a page table can likewise put_page_testzero and clear > the pte, or remove the mapping and pass the page to Andrew's pagevec > machinery, which will eventually do the put_page_testzero. Anyone who > removes a page from a radix tree will also do a put_page_testzero. Exactly > one of those paths will result in the page count reaching zero, which tells > us nobody else holds a reference and it's time for __free_pages_ok. The page > is thus freed immediately as soon as there are no more references to it, and > does not hang around on the lru list. > > Nobody has to lock against the page count. Each put_page_testzero caller > only locks the data structure from which it's removing the reference. > > This seems so simple, what is the flaw? The flaw is in doing the put_page_testzero() outside of any locking which would prevent other CPUs from finding and "rescuing" the zero-recount page. CPUA: if (put_page_testzero()) { /* Here's the window */ spin_lock(lru_lock); list_del(page->lru); CPUB: spin_lock(lru_lock); page = list_entry(lru); page_cache_get(page); /* If this goes from 0->1, we die */ ... page_cache_release(page); /* double free */ 2.5.31-mm1 has tests which make this race enormously improbable [1], but it's still there. It's that `put' outside the lock which is the culprit. Normally, we handle that with atomic_dec_and_lock() (inodes) or by manipulating the refcount inside an area which has exclusion (page presence in pagecache). The sane, sensible and sucky way is to always take the lock: page_cache_release(page) { spin_lock(lru_lock); if (put_page_testzero(page)) { lru_cache_del(page); __free_pages_ok(page, 0); } spin_unlock(lru_lock); } Because this provides exclusion from another CPU discovering the page via the LRU. So taking the above as the design principle, how can we speed it up? How to avoid taking the lock in every page_cache_release()? Maybe: page_cache_release(page) { if (page_count(page) == 1) { spin_lock(lru_lock); if (put_page_testzero(page)) { if (PageLRU(page)) __lru_cache_del(page); __free_pages_ok(page); } spin_unlock(lru_lock); } else { atomic_dec(&page->count); } } This is nice and quick, but racy. Two concurrent page_cache_releases will create a zero-ref unfreed page which is on the LRU. These are rare, and can be mopped up in page reclaim. The above code will also work for pages which aren't on the LRU. It will take the lock unnecessarily for (say) slab pages. But if we put slab pages on the LRU then I suspect there are so few non-LRU pages left that it isn't worth bothering about this. [1] The race requires that the CPU running page_cache_release find a five instruction window against the CPU running shrink_cache. And that they be operating against the same page. And that the CPU running __page_cache_release() then take an interrupt in a 3-4 instruction window. And that the interrupt take longer than the runtime for shrink_list. And that the page be the first page in the pagevec. It's a heat-death-of-the-universe-race, but even if it were to be ignored, the current code is too complex. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 19:24 ` Andrew Morton @ 2002-08-26 19:34 ` Daniel Phillips 2002-08-26 19:48 ` Christian Ehrhardt 2002-08-27 9:22 ` Christian Ehrhardt 2 siblings, 0 replies; 37+ messages in thread From: Daniel Phillips @ 2002-08-26 19:34 UTC (permalink / raw) To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm On Monday 26 August 2002 21:24, Andrew Morton wrote: > Daniel Phillips wrote: > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > > + * does not increase the page count. > > > > > > > > Please remind me... why should it not? > > > > > > Pages that are only on the lru but not reference by anyone are of no > > > use and we want to free them immediatly. If we leave them on the lru > > > list with a page count of 1, someone else will have to walk the lru > > > list and remove pages that are only on the lru. > > > > I don't understand this argument. Suppose lru list membership is worth a > > page count of one. Then anyone who finds a page by way of the lru list can > > safely put_page_testzero and remove the page from the lru list. Anyone who > > finds a page by way of a page table can likewise put_page_testzero and clear > > the pte, or remove the mapping and pass the page to Andrew's pagevec > > machinery, which will eventually do the put_page_testzero. Anyone who > > removes a page from a radix tree will also do a put_page_testzero. Exactly > > one of those paths will result in the page count reaching zero, which tells > > us nobody else holds a reference and it's time for __free_pages_ok. The page > > is thus freed immediately as soon as there are no more references to it, and > > does not hang around on the lru list. > > > > Nobody has to lock against the page count. Each put_page_testzero caller > > only locks the data structure from which it's removing the reference. > > > > This seems so simple, what is the flaw? > > The flaw is in doing the put_page_testzero() outside of any locking > which would prevent other CPUs from finding and "rescuing" the zero-recount > page. > > CPUA: > if (put_page_testzero()) { > /* Here's the window */ > spin_lock(lru_lock); > list_del(page->lru); According to my assumption that lru list membership is (should be) worth one page count, if testzero triggers here the page is not on the lru. > CPUB: > > spin_lock(lru_lock); > page = list_entry(lru); > page_cache_get(page); /* If this goes from 0->1, we die */ It can't. You know that because you found the page on the lru, its count must be at least one (again, according to assumption above). > ... > page_cache_release(page); /* double free */ I'd like to jump in and chase more solutions with you, but the above doesn't prove your point, so I'm not ready to reject this one yet. -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 19:24 ` Andrew Morton 2002-08-26 19:34 ` Daniel Phillips @ 2002-08-26 19:48 ` Christian Ehrhardt 2002-08-27 9:22 ` Christian Ehrhardt 2 siblings, 0 replies; 37+ messages in thread From: Christian Ehrhardt @ 2002-08-26 19:48 UTC (permalink / raw) To: Andrew Morton; +Cc: Daniel Phillips, lkml, linux-mm On Mon, Aug 26, 2002 at 12:24:50PM -0700, Andrew Morton wrote: > The flaw is in doing the put_page_testzero() outside of any locking Well, one could argue that doing the put_page_testzero outside of any locking is a feature. > [ ... ] > > 2.5.31-mm1 has tests which make this race enormously improbable [1], > but it's still there. Agreed. Both on the improbable and on the still there part. > It's that `put' outside the lock which is the culprit. Normally, we > handle that with atomic_dec_and_lock() (inodes) or by manipulating > the refcount inside an area which has exclusion (page presence in > pagecache). > > The sane, sensible and sucky way is to always take the lock: > > page_cache_release(page) > { > spin_lock(lru_lock); > if (put_page_testzero(page)) { > lru_cache_del(page); > __free_pages_ok(page, 0); > } > spin_unlock(lru_lock); > } That would probably solve the problem. > Because this provides exclusion from another CPU discovering the page > via the LRU. > > So taking the above as the design principle, how can we speed it up? > How to avoid taking the lock in every page_cache_release()? Maybe: > > page_cache_release(page) > { > if (page_count(page) == 1) { > spin_lock(lru_lock); > if (put_page_testzero(page)) { > if (PageLRU(page)) > __lru_cache_del(page); > __free_pages_ok(page); > } > spin_unlock(lru_lock); > } else { > atomic_dec(&page->count); > } > } However, this is an incredibly bad idea if the page is NOT on the lru. Think of two instances of page_cache_release racing against each other. This could result in a leaked page which is not on the LRU. > This is nice and quick, but racy. Two concurrent page_cache_releases > will create a zero-ref unfreed page which is on the LRU. These are > rare, and can be mopped up in page reclaim. > > The above code will also work for pages which aren't on the LRU. It will > take the lock unnecessarily for (say) slab pages. But if we put slab pages > on the LRU then I suspect there are so few non-LRU pages left that it isn't > worth bothering about this. No it will not work. See above. > [1] The race requires that the CPU running page_cache_release find a > five instruction window against the CPU running shrink_cache. And > that they be operating against the same page. And that the CPU > running __page_cache_release() then take an interrupt in a 3-4 > instruction window. And that the interrupt take longer than the > runtime for shrink_list. And that the page be the first page in > the pagevec. The interrupt can also be a preemption which might easily take long enough. But I agree that the race is now rare. The real problem is that the locking rules don't guarantee that there are no other racy paths that we both missed. regards Christian -- THAT'S ALL FOLKS! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 19:24 ` Andrew Morton 2002-08-26 19:34 ` Daniel Phillips 2002-08-26 19:48 ` Christian Ehrhardt @ 2002-08-27 9:22 ` Christian Ehrhardt 2002-08-27 19:19 ` Andrew Morton 2 siblings, 1 reply; 37+ messages in thread From: Christian Ehrhardt @ 2002-08-27 9:22 UTC (permalink / raw) To: Andrew Morton; +Cc: Daniel Phillips, lkml, linux-mm On Mon, Aug 26, 2002 at 12:24:50PM -0700, Andrew Morton wrote: > The flaw is in doing the put_page_testzero() outside of any locking > which would prevent other CPUs from finding and "rescuing" the zero-recount > page. > > CPUA: > if (put_page_testzero()) { > /* Here's the window */ > spin_lock(lru_lock); > list_del(page->lru); > > CPUB: > > spin_lock(lru_lock); > page = list_entry(lru); > page_cache_get(page); /* If this goes from 0->1, we die */ > ... > page_cache_release(page); /* double free */ So what we want CPUB do instead is spin_lock(lru_lock); page = list_entry(lru) START ATOMIC page_cache_get(page); res = (page_count (page) == 1) END ATOMIC if (res) { atomic_dec (&page->count); continue; /* with next page */ } ... page_cache_release (page); I.e. we want to detect _atomically_ that we just raised the page count from zero to one. My patch actually has a solution that implements the needed atomic operation above by means of the atomic functions that we currently have on all archs (it's called get_page_testzero and should probably called get_page_testone). The more I think about this the more I think this is the way to go. regards Christian -- THAT'S ALL FOLKS! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-27 9:22 ` Christian Ehrhardt @ 2002-08-27 19:19 ` Andrew Morton 0 siblings, 0 replies; 37+ messages in thread From: Andrew Morton @ 2002-08-27 19:19 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Daniel Phillips, lkml, linux-mm Christian Ehrhardt wrote: > > ... > So what we want CPUB do instead is > > spin_lock(lru_lock); > page = list_entry(lru) > > START ATOMIC > page_cache_get(page); > res = (page_count (page) == 1) > END ATOMIC > > if (res) { > atomic_dec (&page->count); > continue; /* with next page */ > } > ... > page_cache_release (page); > > I.e. we want to detect _atomically_ that we just raised the page count > from zero to one. My patch actually has a solution that implements the > needed atomic operation above by means of the atomic functions that we > currently have on all archs (it's called get_page_testzero and > should probably called get_page_testone). > The more I think about this the more I think this is the way to go. > Yes, I think that would provide a minimal fix to the problem. (I'd prefer a solution in which presence on the LRU contributes to page->count, because that means I can dump a load of expensive page_cache_get-inside-lru-lock instances, but whatever) You had: -#define put_page_testzero(p) atomic_dec_and_test(&(p)->count) -#define page_count(p) atomic_read(&(p)->count) -#define set_page_count(p,v) atomic_set(&(p)->count, v) +#define put_page_testzero(p) atomic_add_negative(-1, &(p)->count) +#define page_count(p) (1+atomic_read(&(p)->count)) +#define set_page_count(p,v) atomic_set(&(p)->count, v-1) +#define get_page_testzero(p) atomic_inc_and_test(&(p)->count) So the page count is actually offset by -1, and that is hidden by the macros. Fair enough. atomic_add_negative() is not implemented on quite a number of architectures (sparc64, mips, ppc, sh, cris, 68k, alpha..), so some legwork is needed there. Looks to be pretty simple though; alpha, ppc and others already have atomic_add_return(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 17:56 ` Daniel Phillips 2002-08-26 19:24 ` Andrew Morton @ 2002-08-26 20:00 ` Christian Ehrhardt 2002-08-26 20:09 ` Daniel Phillips 1 sibling, 1 reply; 37+ messages in thread From: Christian Ehrhardt @ 2002-08-26 20:00 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote: > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > + * does not increase the page count. > > > > > > Please remind me... why should it not? > > > > Pages that are only on the lru but not reference by anyone are of no > > use and we want to free them immediatly. If we leave them on the lru > > list with a page count of 1, someone else will have to walk the lru > > list and remove pages that are only on the lru. > > I don't understand this argument. Suppose lru list membership is worth a > page count of one. Then anyone who finds a page by way of the lru list can This does fix the double free problem but think of a typical anonymous page at exit. The page is on the lru list and there is one reference held by the pte. According to your scheme the pte reference would be freed (obviously due to the exit) but the page would remain on the lru list. However, there is no point in leaving the page on the lru list at all. If you think about who is going to remove the page from the lru you'll see the problem. regards Christian -- THAT'S ALL FOLKS! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 20:00 ` Christian Ehrhardt @ 2002-08-26 20:09 ` Daniel Phillips 2002-08-26 20:58 ` Christian Ehrhardt 2002-08-26 21:31 ` Andrew Morton 0 siblings, 2 replies; 37+ messages in thread From: Daniel Phillips @ 2002-08-26 20:09 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm On Monday 26 August 2002 22:00, Christian Ehrhardt wrote: > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote: > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > > + * does not increase the page count. > > > > > > > > Please remind me... why should it not? > > > > > > Pages that are only on the lru but not reference by anyone are of no > > > use and we want to free them immediatly. If we leave them on the lru > > > list with a page count of 1, someone else will have to walk the lru > > > list and remove pages that are only on the lru. > > > > I don't understand this argument. Suppose lru list membership is worth a > > page count of one. Then anyone who finds a page by way of the lru list can > > This does fix the double free problem but think of a typical anonymous > page at exit. The page is on the lru list and there is one reference held > by the pte. According to your scheme the pte reference would be freed > (obviously due to the exit) but the page would remain on the lru list. > However, there is no point in leaving the page on the lru list at all. If you want the page off the lru list at that point (which you probably do) then you take the lru lock and put_page_testzero. > If you think about who is going to remove the page from the lru you'll > see the problem. Nope, still don't see it. Whoever hits put_page_testzero frees the page, secure in the knowlege that there are no other references to it. -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 20:09 ` Daniel Phillips @ 2002-08-26 20:58 ` Christian Ehrhardt 2002-08-27 16:48 ` Daniel Phillips 2002-08-26 21:31 ` Andrew Morton 1 sibling, 1 reply; 37+ messages in thread From: Christian Ehrhardt @ 2002-08-26 20:58 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm On Mon, Aug 26, 2002 at 10:09:38PM +0200, Daniel Phillips wrote: > On Monday 26 August 2002 22:00, Christian Ehrhardt wrote: > > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote: > > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > > > + * does not increase the page count. > > > > > > > > > > Please remind me... why should it not? > > > > > > > > Pages that are only on the lru but not reference by anyone are of no > > > > use and we want to free them immediatly. If we leave them on the lru > > > > list with a page count of 1, someone else will have to walk the lru > > > > list and remove pages that are only on the lru. > > > > > > I don't understand this argument. Suppose lru list membership is worth a > > > page count of one. Then anyone who finds a page by way of the lru list can > > > > This does fix the double free problem but think of a typical anonymous > > page at exit. The page is on the lru list and there is one reference held > > by the pte. According to your scheme the pte reference would be freed > > (obviously due to the exit) but the page would remain on the lru list. > > However, there is no point in leaving the page on the lru list at all. > > If you want the page off the lru list at that point (which you probably do) > then you take the lru lock and put_page_testzero. Could you clarify what you mean with "at that point"? Especially how do you plan to test for "this point". Besides it is illegal to use the page after put_page_testzero (unless put_page_testzero returns true). > > If you think about who is going to remove the page from the lru you'll > > see the problem. > > Nope, still don't see it. Whoever hits put_page_testzero frees the page, > secure in the knowlege that there are no other references to it. Well yes, but we cannot remove the page from the lru atomatically at page_cache_release time if we follow your proposal. If you think we can, show me your implementation of page_cache_release and I'll show you where the races are (unless you do everything under the lru_lock of course). regards Christian -- THAT'S ALL FOLKS! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 20:58 ` Christian Ehrhardt @ 2002-08-27 16:48 ` Daniel Phillips 2002-08-28 13:14 ` Christian Ehrhardt 0 siblings, 1 reply; 37+ messages in thread From: Daniel Phillips @ 2002-08-27 16:48 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm On Monday 26 August 2002 22:58, Christian Ehrhardt wrote: > On Mon, Aug 26, 2002 at 10:09:38PM +0200, Daniel Phillips wrote: > > On Monday 26 August 2002 22:00, Christian Ehrhardt wrote: > > > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote: > > > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote: > > > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote: > > > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote: > > > > > > > + * A special Problem is the lru lists. Presence on one of these lists > > > > > > > + * does not increase the page count. > > > > > > > > > > > > Please remind me... why should it not? > > > > > > > > > > Pages that are only on the lru but not reference by anyone are of no > > > > > use and we want to free them immediatly. If we leave them on the lru > > > > > list with a page count of 1, someone else will have to walk the lru > > > > > list and remove pages that are only on the lru. > > > > > > > > I don't understand this argument. Suppose lru list membership is worth a > > > > page count of one. Then anyone who finds a page by way of the lru list can > > > > > > This does fix the double free problem but think of a typical anonymous > > > page at exit. The page is on the lru list and there is one reference held > > > by the pte. According to your scheme the pte reference would be freed > > > (obviously due to the exit) but the page would remain on the lru list. > > > However, there is no point in leaving the page on the lru list at all. > > > > If you want the page off the lru list at that point (which you probably do) > > then you take the lru lock and put_page_testzero. > > Could you clarify what you mean with "at that point"? Especially how > do you plan to test for "this point". Besides it is illegal to use > the page after put_page_testzero (unless put_page_testzero returns true). > > > If you think about who is going to remove the page from the lru you'll > > > see the problem. > > > > Nope, still don't see it. Whoever hits put_page_testzero frees the page, > > secure in the knowlege that there are no other references to it. > > Well yes, but we cannot remove the page from the lru atomatically > at page_cache_release time if we follow your proposal. If you think we can, > show me your implementation of page_cache_release and I'll show > you where the races are (unless you do everything under the lru_lock > of course). void page_cache_release(struct page *page) { spin_lock(&pagemap_lru_lock); if (PageLRU(page) && page_count(page) == 2) { __lru_cache_del(page); atomic_dec(&page->count); } spin_unlock(&pagemap_lru_lock); if (put_page_testzero(page)) __free_pages_ok(page, 0); } This allows the following benign race, with initial page count = 3: spin_lock(&pagemap_lru_lock); if (PageLRU(page) && page_count(page) == 2) /* false */ spin_unlock(&pagemap_lru_lock); spin_lock(&pagemap_lru_lock); if (PageLRU(page) && page_count(page) == 2) /* false */ spin_unlock(&pagemap_lru_lock); if (put_page_testzero(page)) __free_pages_ok(page, 0); if (put_page_testzero(page)) __free_pages_ok(page, 0); Neither holder of a page reference sees the count at 2, and so the page is left on the lru with count = 1. This won't happen often and such pages will be recovered from the cold end of the list in due course. The important question is: can this code ever remove a page from the lru erroneously, leaving somebody holding a reference to a non-lru page? In other words, can the test PageLRU(page) && page_count(page) == 2 return a false positive? Well, when this test is true we can account for both both references: the one we own, and the one the lru list owns. Since we hold the lru lock, the latter won't change. Nobody else has the right to increment the page count, since they must inherit that right from somebody who holds a reference, and there are none. We could also do this: void page_cache_release(struct page *page) { if (page_count(page) == 2) { spin_lock(&pagemap_lru_lock); if (PageLRU(page) && page_count(page) == 2) { __lru_cache_del(page); atomic_dec(&page->count); } spin_unlock(&pagemap_lru_lock); } if (put_page_testzero(page)) __free_pages_ok(page, 0); } Which avoids taking the lru lock sometimes in exchange for widening the hole through which pages can end up with count = 1 on the lru list. Let's run this through your race detector and see what happens. -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-27 16:48 ` Daniel Phillips @ 2002-08-28 13:14 ` Christian Ehrhardt 2002-08-28 17:18 ` Daniel Phillips 2002-08-28 20:41 ` Daniel Phillips 0 siblings, 2 replies; 37+ messages in thread From: Christian Ehrhardt @ 2002-08-28 13:14 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm On Tue, Aug 27, 2002 at 06:48:50PM +0200, Daniel Phillips wrote: > On Monday 26 August 2002 22:58, Christian Ehrhardt wrote: > > > Nope, still don't see it. Whoever hits put_page_testzero frees the page, > > > secure in the knowlege that there are no other references to it. > > > > Well yes, but we cannot remove the page from the lru atomatically > > at page_cache_release time if we follow your proposal. If you think we can, > > show me your implementation of page_cache_release and I'll show > > you where the races are (unless you do everything under the lru_lock > > of course). > > void page_cache_release(struct page *page) > { > spin_lock(&pagemap_lru_lock); > if (PageLRU(page) && page_count(page) == 2) { > __lru_cache_del(page); > atomic_dec(&page->count); > } > spin_unlock(&pagemap_lru_lock); > if (put_page_testzero(page)) > __free_pages_ok(page, 0); > } > > This allows the following benign race, with initial page count = 3: > [ ...] > Neither holder of a page reference sees the count at 2, and so the page > is left on the lru with count = 1. This won't happen often and such > pages will be recovered from the cold end of the list in due course. Ok, agreed. I think this will work but taking the lru lock each time is probably not a good idea. > We could also do this: > > void page_cache_release(struct page *page) > { > if (page_count(page) == 2) { > spin_lock(&pagemap_lru_lock); > if (PageLRU(page) && page_count(page) == 2) { > __lru_cache_del(page); > atomic_dec(&page->count); > } > spin_unlock(&pagemap_lru_lock); > } > if (put_page_testzero(page)) > __free_pages_ok(page, 0); > } > > Which avoids taking the lru lock sometimes in exchange for widening the > hole through which pages can end up with count = 1 on the lru list. This sounds like something that is worth trying. I missed that one. Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive or shrink_cache could have removed the page from the lru before __pagevec_lru_del acquired the lru lock. regards Christian -- THAT'S ALL FOLKS! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 13:14 ` Christian Ehrhardt @ 2002-08-28 17:18 ` Daniel Phillips 2002-08-28 17:42 ` Andrew Morton 2002-08-28 20:41 ` Daniel Phillips 1 sibling, 1 reply; 37+ messages in thread From: Daniel Phillips @ 2002-08-28 17:18 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm On Wednesday 28 August 2002 15:14, Christian Ehrhardt wrote: > Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive > or shrink_cache could have removed the page from the lru before > __pagevec_lru_del acquired the lru lock. It's suspect all right. If there's a chain of assumptions that proves the page is always on the lru at the point, I haven't seen it yet. -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 17:18 ` Daniel Phillips @ 2002-08-28 17:42 ` Andrew Morton 0 siblings, 0 replies; 37+ messages in thread From: Andrew Morton @ 2002-08-28 17:42 UTC (permalink / raw) To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm Daniel Phillips wrote: > > On Wednesday 28 August 2002 15:14, Christian Ehrhardt wrote: > > Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive > > or shrink_cache could have removed the page from the lru before > > __pagevec_lru_del acquired the lru lock. > > It's suspect all right. If there's a chain of assumptions that proves > the page is always on the lru at the point, I haven't seen it yet. Yeah. __pagevec_lru_del is only used by invalidate_inode_pages. A very simple solution is to just delete it. untested code: include/linux/pagevec.h | 7 ------- mm/filemap.c | 10 +++++----- mm/swap.c | 28 ---------------------------- 3 files changed, 5 insertions(+), 40 deletions(-) --- 2.5.32/mm/filemap.c~pagevec_lru_del Wed Aug 28 09:51:51 2002 +++ 2.5.32-akpm/mm/filemap.c Wed Aug 28 09:51:51 2002 @@ -116,10 +116,10 @@ void invalidate_inode_pages(struct inode struct list_head *head, *curr; struct page * page; struct address_space *mapping = inode->i_mapping; - struct pagevec lru_pvec; + struct pagevec pvec; head = &mapping->clean_pages; - pagevec_init(&lru_pvec); + pagevec_init(&pvec); write_lock(&mapping->page_lock); curr = head->next; @@ -143,8 +143,8 @@ void invalidate_inode_pages(struct inode __remove_from_page_cache(page); unlock_page(page); - if (!pagevec_add(&lru_pvec, page)) - __pagevec_lru_del(&lru_pvec); + if (!pagevec_add(&pvec, page)) + __pagevec_release(&pvec); continue; unlock: unlock_page(page); @@ -152,7 +152,7 @@ unlock: } write_unlock(&mapping->page_lock); - pagevec_lru_del(&lru_pvec); + pagevec_release(&pvec); } static int do_invalidatepage(struct page *page, unsigned long offset) --- 2.5.32/include/linux/pagevec.h~pagevec_lru_del Wed Aug 28 09:51:51 2002 +++ 2.5.32-akpm/include/linux/pagevec.h Wed Aug 28 09:51:51 2002 @@ -18,7 +18,6 @@ void __pagevec_release(struct pagevec *p void __pagevec_release_nonlru(struct pagevec *pvec); void __pagevec_free(struct pagevec *pvec); void __pagevec_lru_add(struct pagevec *pvec); -void __pagevec_lru_del(struct pagevec *pvec); void lru_add_drain(void); void pagevec_deactivate_inactive(struct pagevec *pvec); @@ -69,9 +68,3 @@ static inline void pagevec_lru_add(struc if (pagevec_count(pvec)) __pagevec_lru_add(pvec); } - -static inline void pagevec_lru_del(struct pagevec *pvec) -{ - if (pagevec_count(pvec)) - __pagevec_lru_del(pvec); -} --- 2.5.32/mm/swap.c~pagevec_lru_del Wed Aug 28 09:51:51 2002 +++ 2.5.32-akpm/mm/swap.c Wed Aug 28 09:51:58 2002 @@ -214,34 +214,6 @@ void __pagevec_lru_add(struct pagevec *p } /* - * Remove the passed pages from the LRU, then drop the caller's refcount on - * them. Reinitialises the caller's pagevec. - */ -void __pagevec_lru_del(struct pagevec *pvec) -{ - int i; - struct zone *zone = NULL; - - for (i = 0; i < pagevec_count(pvec); i++) { - struct page *page = pvec->pages[i]; - struct zone *pagezone = page_zone(page); - - if (pagezone != zone) { - if (zone) - spin_unlock_irq(&zone->lru_lock); - zone = pagezone; - spin_lock_irq(&zone->lru_lock); - } - if (!TestClearPageLRU(page)) - BUG(); - del_page_from_lru(zone, page); - } - if (zone) - spin_unlock_irq(&zone->lru_lock); - pagevec_release(pvec); -} - -/* * Perform any setup for the swap system */ void __init swap_setup(void) . -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 13:14 ` Christian Ehrhardt 2002-08-28 17:18 ` Daniel Phillips @ 2002-08-28 20:41 ` Daniel Phillips 2002-08-28 21:03 ` Andrew Morton 1 sibling, 1 reply; 37+ messages in thread From: Daniel Phillips @ 2002-08-28 20:41 UTC (permalink / raw) To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm Going right back to basics, what do you suppose is wrong with the 2.4 strategy of always doing the lru removal in free_pages_ok? -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 20:41 ` Daniel Phillips @ 2002-08-28 21:03 ` Andrew Morton 2002-08-28 22:04 ` Daniel Phillips 0 siblings, 1 reply; 37+ messages in thread From: Andrew Morton @ 2002-08-28 21:03 UTC (permalink / raw) To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm Daniel Phillips wrote: > > Going right back to basics, what do you suppose is wrong with the 2.4 > strategy of always doing the lru removal in free_pages_ok? That's equivalent to what we have at present, which is: if (put_page_testzero(page)) { /* window here */ lru_cache_del(page); __free_pages_ok(page, 0); } versus: spin_lock(lru lock); page = list_entry(lru, ...); if (page_count(page) == 0) continue; /* window here */ page_cache_get(page); page_cache_release(page); /* double-free */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 21:03 ` Andrew Morton @ 2002-08-28 22:04 ` Daniel Phillips 2002-08-28 22:39 ` Andrew Morton 0 siblings, 1 reply; 37+ messages in thread From: Daniel Phillips @ 2002-08-28 22:04 UTC (permalink / raw) To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm On Wednesday 28 August 2002 23:03, Andrew Morton wrote: > Daniel Phillips wrote: > > > > Going right back to basics, what do you suppose is wrong with the 2.4 > > strategy of always doing the lru removal in free_pages_ok? > > That's equivalent to what we have at present, which is: > > if (put_page_testzero(page)) { > /* window here */ > lru_cache_del(page); > __free_pages_ok(page, 0); > } > > versus: > > spin_lock(lru lock); > page = list_entry(lru, ...); > if (page_count(page) == 0) > continue; > /* window here */ > page_cache_get(page); > page_cache_release(page); /* double-free */ Indeed it is. In 2.4.19 we have: (vmscan.c: shrink_cache) (page_alloc.c: __free_pages) 365 if (unlikely(!page_count(page))) 366 continue; 444 if (!PageReserved(page) && put_page_testzero(page)) [many twisty paths, all different] 511 /* effectively free the page here */ 512 page_cache_release(page); 445 __free_pages_ok(page, order); [free it again just to make sure] So there's no question that the race is lurking in 2.4. I noticed several more paths besides the one above that look suspicious as well. The bottom line is, 2.4 needs a fix along the lines of my suggestion or Christian's, something that can actually be proved. It's a wonder that this problem manifests so rarely in practice. -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 22:04 ` Daniel Phillips @ 2002-08-28 22:39 ` Andrew Morton 2002-08-28 22:57 ` Daniel Phillips 0 siblings, 1 reply; 37+ messages in thread From: Andrew Morton @ 2002-08-28 22:39 UTC (permalink / raw) To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm Daniel Phillips wrote: > > ... > So there's no question that the race is lurking in 2.4. I noticed several > more paths besides the one above that look suspicious as well. The bottom > line is, 2.4 needs a fix along the lines of my suggestion or Christian's, > something that can actually be proved. > > It's a wonder that this problem manifests so rarely in practice. I sort-of glanced through the 2.4 paths and it appears that in all of the places where it could do a page_cache_get/release, that would never happen because of other parts of the page state. Like: it can't be in pagecache, so we won't run writepage, and it can't have buffers, so we won't run try_to_release_page(). Of course, I might have missed a path. And, well, generally: ugh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-28 22:39 ` Andrew Morton @ 2002-08-28 22:57 ` Daniel Phillips 0 siblings, 0 replies; 37+ messages in thread From: Daniel Phillips @ 2002-08-28 22:57 UTC (permalink / raw) To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm On Thursday 29 August 2002 00:39, Andrew Morton wrote: > Daniel Phillips wrote: > > > > ... > > So there's no question that the race is lurking in 2.4. I noticed several > > more paths besides the one above that look suspicious as well. The bottom > > line is, 2.4 needs a fix along the lines of my suggestion or Christian's, > > something that can actually be proved. > > > > It's a wonder that this problem manifests so rarely in practice. > > I sort-of glanced through the 2.4 paths and it appears that in all of the > places where it could do a page_cache_get/release, that would never happen > because of other parts of the page state. > > Like: it can't be in pagecache, so we won't run writepage, and > it can't have buffers, so we won't run try_to_release_page(). > > Of course, I might have missed a path. And, well, generally: ugh. I think it is happening. I just went sifting searching through the archives on 'oops' and '2.4'. The first one I found was: 2.4.18-xfs (xfs related?) oops report which fits the description nicely. The race I showed actually causes the page->count to go negative, avoiding a double free on a technicality. That doesn't make me feel much better about it. Have you got a BUG_ON(!page_count(page)) in put_page_testzero? I think we might see some action. -- Daniel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 20:09 ` Daniel Phillips 2002-08-26 20:58 ` Christian Ehrhardt @ 2002-08-26 21:31 ` Andrew Morton 2002-08-27 3:42 ` Benjamin LaHaise 1 sibling, 1 reply; 37+ messages in thread From: Andrew Morton @ 2002-08-26 21:31 UTC (permalink / raw) To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm Daniel Phillips wrote: > > ... > > If you think about who is going to remove the page from the lru you'll > > see the problem. > > Nope, still don't see it. Whoever hits put_page_testzero frees the page, > secure in the knowlege that there are no other references to it. Sure. But this requires that the caller of page_cache_release() has previously removed the page from the LRU. We (used to) do that for truncate and page reclaim. But we did not do that for anon pages. For anon pages, we perform magical LRU removal when the page refcount goes to zero. The fact that we performed explicit removal in one place, and magical removal in the other was unfortunate. I nuked the explicit removal and made it all magical (explicit removal in truncate_complete_page() was wrong anyway - the page could have been rescued and anonymised by concurrent pagefault and must stay on the LRU). Possibly, we could go back to explicit removal everywhere. Haven't really looked at that, but I suspect we're back to a similar problem: to do you unracily determine whether the page should be removed from the LRU? Take ->page_table_lock and look at page_count(page)? Worried. I like the magical-removal-just-before-free, and my gut feel is that it'll provide a cleaner end result. Making presence on the LRU contribute to page->count is attractive, if only because it removes some irritating and expensive page_cache_gets and puts from shrink_cache and refill_inactive. But for it to be useful, we must perform explicit removal everywhere. Making presence on the LRU contribute to page->count doesn't fundamentally change anything of course - it offsets the current problems by one. Then again, it would remove all page_cache_gets/releases from vmscan.c and may thus make the race go away. That's a bit of a timebomb though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 21:31 ` Andrew Morton @ 2002-08-27 3:42 ` Benjamin LaHaise 2002-08-27 4:37 ` Andrew Morton 0 siblings, 1 reply; 37+ messages in thread From: Benjamin LaHaise @ 2002-08-27 3:42 UTC (permalink / raw) To: Andrew Morton; +Cc: Daniel Phillips, Christian Ehrhardt, lkml, linux-mm On Mon, Aug 26, 2002 at 02:31:57PM -0700, Andrew Morton wrote: > I like the magical-removal-just-before-free, and my gut feel is that > it'll provide a cleaner end result. For the record, I'd rather see explicite removal everwhere. We received a number of complaints along the lines of "I run my app immediately after system startup, and it's fast, but the second time it's slower" due to the lazy page reclaim in early 2.4. Until there's a way to make LRU scanning faster than page allocation, it can't be lazy. -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-27 3:42 ` Benjamin LaHaise @ 2002-08-27 4:37 ` Andrew Morton 0 siblings, 0 replies; 37+ messages in thread From: Andrew Morton @ 2002-08-27 4:37 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Daniel Phillips, Christian Ehrhardt, lkml, linux-mm Benjamin LaHaise wrote: > > On Mon, Aug 26, 2002 at 02:31:57PM -0700, Andrew Morton wrote: > > I like the magical-removal-just-before-free, and my gut feel is that > > it'll provide a cleaner end result. > > For the record, I'd rather see explicite removal everwhere. We received > a number of complaints along the lines of "I run my app immediately after > system startup, and it's fast, but the second time it's slower" due to > the lazy page reclaim in early 2.4. Until there's a way to make LRU > scanning faster than page allocation, it can't be lazy. > I think that's what Rik was referring to. But here, "explicit removal" refers to running lru_cache_del() prior to the final put_page, rather than within the context of the final put_page. So it's a different thing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 2:29 MM patches against 2.5.31 Andrew Morton 2002-08-22 11:28 ` Christian Ehrhardt @ 2002-08-22 15:59 ` Steven Cole 2002-08-22 16:06 ` Martin J. Bligh 1 sibling, 1 reply; 37+ messages in thread From: Steven Cole @ 2002-08-22 15:59 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, linux-mm On Wed, 2002-08-21 at 20:29, Andrew Morton wrote: > I've uploaded a rollup of pending fixes and feature work > against 2.5.31 to > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/ > > The rolled up patch there is suitable for ongoing testing and > development. The individual patches are in the broken-out/ > directory and should all be documented. The good news: I ran my dbench 1..128 stress test and for the first time since 2.5.31-vanilla there were _no_ BUG()s reported at all. The other news: from dmesg: kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,3), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald: page allocation failure. order:0, mode:0x0 The kjournald failure message came out with dbench 48 running on an ext3 partition. The test continued with only this one instance of this message. Steven -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 15:59 ` Steven Cole @ 2002-08-22 16:06 ` Martin J. Bligh 2002-08-22 19:45 ` Steven Cole 2002-08-26 2:15 ` Andrew Morton 0 siblings, 2 replies; 37+ messages in thread From: Martin J. Bligh @ 2002-08-22 16:06 UTC (permalink / raw) To: Steven Cole, Andrew Morton; +Cc: lkml, linux-mm > kjournald: page allocation failure. order:0, mode:0x0 I've seen this before, but am curious how we ever passed a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere that does this? Thanks, M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 16:06 ` Martin J. Bligh @ 2002-08-22 19:45 ` Steven Cole 2002-08-26 2:15 ` Andrew Morton 1 sibling, 0 replies; 37+ messages in thread From: Steven Cole @ 2002-08-22 19:45 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, lkml, linux-mm On Thu, 2002-08-22 at 10:06, Martin J. Bligh wrote: > > kjournald: page allocation failure. order:0, mode:0x0 > > I've seen this before, but am curious how we ever passed > a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere > that does this? > > Thanks, > > M. I ran dbench 1..128 on 2.5.31-mm1 several more times with nothing unusual happening, and then got this from pdflush with dbench 96. pdflush: page allocation failure. order:0, mode:0x0 FWIW, this 2.5.31-mm1 kernel is SMP, HIGHMEM4G, no PREEMPT. Steven -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-22 16:06 ` Martin J. Bligh 2002-08-22 19:45 ` Steven Cole @ 2002-08-26 2:15 ` Andrew Morton 2002-08-26 2:08 ` Martin J. Bligh 1 sibling, 1 reply; 37+ messages in thread From: Andrew Morton @ 2002-08-26 2:15 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Steven Cole, lkml, linux-mm "Martin J. Bligh" wrote: > > > kjournald: page allocation failure. order:0, mode:0x0 > > I've seen this before, but am curious how we ever passed > a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere > that does this? Could be anywhere, really. A network interrupt doing GFP_ATOMIC while kjournald is executing. A radix-tree node allocation on the add-to-swap path perhaps. (The swapout failure messages aren't supposed to come out, but mempool_alloc() stomps on the caller's setting of PF_NOWARN.) Or: mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l 89 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 2:15 ` Andrew Morton @ 2002-08-26 2:08 ` Martin J. Bligh 2002-08-26 2:32 ` Andrew Morton 0 siblings, 1 reply; 37+ messages in thread From: Martin J. Bligh @ 2002-08-26 2:08 UTC (permalink / raw) To: Andrew Morton; +Cc: Steven Cole, lkml, linux-mm >> > kjournald: page allocation failure. order:0, mode:0x0 >> >> I've seen this before, but am curious how we ever passed >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere >> that does this? > > Could be anywhere, really. A network interrupt doing GFP_ATOMIC > while kjournald is executing. A radix-tree node allocation > on the add-to-swap path perhaps. (The swapout failure messages > aren't supposed to come out, but mempool_alloc() stomps on the > caller's setting of PF_NOWARN.) > > Or: > > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l > 89 No, GFP_ATOMIC is not 0: #define __GFP_HIGH 0x20 /* Should access emergency pools? */ #define GFP_ATOMIC (__GFP_HIGH) Looking at all the options: #define __GFP_WAIT 0x10 /* Can wait and reschedule? */ #define __GFP_HIGH 0x20 /* Should access emergency pools? */ #define __GFP_IO 0x40 /* Can start low memory physical IO? */ #define __GFP_HIGHIO 0x80 /* Can start high mem physical IO? */ #define __GFP_FS 0x100 /* Can call down to low-level FS? */ What worries me is that 0 seems to mean "you can't do anything to try and free it, but you can't access the emergency pools either". Seems doomed to failure to me. And the standard sets we have are #define GFP_NOHIGHIO ( __GFP_WAIT | __GFP_IO) #define GFP_NOIO ( __GFP_WAIT) #define GFP_NOFS ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO) #define GFP_ATOMIC (__GFP_HIGH) #define GFP_USER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS) #define GFP_HIGHUSER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS | __GFP_HIGHMEM) #define GFP_KERNEL ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS) #define GFP_NFS ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS) #define GFP_KSWAPD ( __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS) So I think someone's screwed something up, and this is accidental. Or I'm just totally misunderstanding this, which is perfectly possible. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 2:08 ` Martin J. Bligh @ 2002-08-26 2:32 ` Andrew Morton 2002-08-26 3:06 ` Steven Cole 0 siblings, 1 reply; 37+ messages in thread From: Andrew Morton @ 2002-08-26 2:32 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Steven Cole, lkml, linux-mm "Martin J. Bligh" wrote: > > >> > kjournald: page allocation failure. order:0, mode:0x0 > >> > >> I've seen this before, but am curious how we ever passed > >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere > >> that does this? > > > > Could be anywhere, really. A network interrupt doing GFP_ATOMIC > > while kjournald is executing. A radix-tree node allocation > > on the add-to-swap path perhaps. (The swapout failure messages > > aren't supposed to come out, but mempool_alloc() stomps on the > > caller's setting of PF_NOWARN.) > > > > Or: > > > > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l > > 89 > > No, GFP_ATOMIC is not 0: > It's mempool_alloc(GFP_NOIO) or such. mempool_alloc() strips __GFP_WAIT|__GFP_IO on the first attempt. It also disables the printk, so maybe I just dunno ;) show_stack() would tell. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 2:32 ` Andrew Morton @ 2002-08-26 3:06 ` Steven Cole 0 siblings, 0 replies; 37+ messages in thread From: Steven Cole @ 2002-08-26 3:06 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin J. Bligh, lkml, linux-mm On Sun, 2002-08-25 at 20:32, Andrew Morton wrote: > "Martin J. Bligh" wrote: > > > > >> > kjournald: page allocation failure. order:0, mode:0x0 > > >> > > >> I've seen this before, but am curious how we ever passed > > >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere > > >> that does this? > > > > > > Could be anywhere, really. A network interrupt doing GFP_ATOMIC > > > while kjournald is executing. A radix-tree node allocation > > > on the add-to-swap path perhaps. (The swapout failure messages > > > aren't supposed to come out, but mempool_alloc() stomps on the > > > caller's setting of PF_NOWARN.) > > > > > > Or: > > > > > > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l > > > 89 > > > > No, GFP_ATOMIC is not 0: > > > > It's mempool_alloc(GFP_NOIO) or such. mempool_alloc() strips > __GFP_WAIT|__GFP_IO on the first attempt. > > It also disables the printk, so maybe I just dunno ;) show_stack() > would tell. > The "kjournald: page allocation failure. order:0, mode:0x0" message and "pdflush: page allocation failure. order:0, mode:0x0" occurred only once each on my dual p3 scsi ext3 test box running 2.5.31-mm1. So, I added something like this: --- page_alloc.c.orig Thu Aug 22 17:27:32 2002 +++ page_alloc.c Thu Aug 22 17:29:24 2002 @@ -388,6 +388,8 @@ printk("%s: page allocation failure." " order:%d, mode:0x%x\n", current->comm, order, gfp_mask); + if (gfp_mask == 0) + BUG(); } return NULL; } and continued testing on Friday with no repeats of the "page allocation failure" messages. I obtained a second dual p3 ext3 test box (ide this time) and left both boxes running 2.5.31-mm1 and the dbench 1..128 stress test scripted to rerun many times over the weekend. Due to a couple of firewalls, I can't look at those boxes from here, but I'll let you know what happened in about 10 to 11 hours. Cheers, Steven -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31
@ 2002-08-26 22:09 Ed Tomlinson
2002-08-26 23:58 ` Andrew Morton
0 siblings, 1 reply; 37+ messages in thread
From: Ed Tomlinson @ 2002-08-26 22:09 UTC (permalink / raw)
To: linux-mm, linux-kernel; +Cc: Andrew Morton, Christian Ehrhardt, Daniel Phillips
This seems to have been missed:
Linus Torvalds wrote:
> In article <3D6989F7.9ED1948A@zip.com.au>,
> Andrew Morton <akpm@zip.com.au> wrote:
>>
>>What I'm inclined to do there is to change __page_cache_release()
>>to not attempt to free the page at all. Just let it sit on the
>>LRU until page reclaim encounters it. With the anon-free-via-pagevec
>>patch, very, very, very few pages actually get their final release in
>>__page_cache_release() - zero on uniprocessor, I expect.
>
> If you do this, then I would personally suggest a conceptually different
> approach: make the LRU list count towards the page count. That will
> _automatically_ result in what you describe - if a page is on the LRU
> list, then "freeing" it will always just decrement the count, and the
> _real_ free comes from walking the LRU list and considering count==1 to
> be trivially freeable.
>
> That way you don't have to have separate functions for releasing
> different kinds of pages (we've seen how nasty that was from a
> maintainance standpoint already with the "put_page vs
> page_cache_release" thing).
>
> Ehh?
If every structure locks before removing its reference (ie before testing and/or
removing a lru reference we take zone->lru_lock, for slabs take cachep->spinlock
etc) Its a bit of an audit task to make sure the various locks are taken (and
documented) though.
By leting the actual free be lazy as Linus suggests things should simplify nicely.
comments,
Ed Tomlinson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: MM patches against 2.5.31 2002-08-26 22:09 Ed Tomlinson @ 2002-08-26 23:58 ` Andrew Morton 2002-08-27 0:13 ` Rik van Riel 0 siblings, 1 reply; 37+ messages in thread From: Andrew Morton @ 2002-08-26 23:58 UTC (permalink / raw) To: Ed Tomlinson; +Cc: linux-mm, linux-kernel, Christian Ehrhardt, Daniel Phillips Ed Tomlinson wrote: > > This seems to have been missed: Still thinking about it. > Linus Torvalds wrote: > > > In article <3D6989F7.9ED1948A@zip.com.au>, > > Andrew Morton <akpm@zip.com.au> wrote: > >> > >>What I'm inclined to do there is to change __page_cache_release() > >>to not attempt to free the page at all. Just let it sit on the > >>LRU until page reclaim encounters it. With the anon-free-via-pagevec > >>patch, very, very, very few pages actually get their final release in > >>__page_cache_release() - zero on uniprocessor, I expect. > > > > If you do this, then I would personally suggest a conceptually different > > approach: make the LRU list count towards the page count. That will > > _automatically_ result in what you describe - if a page is on the LRU > > list, then "freeing" it will always just decrement the count, and the > > _real_ free comes from walking the LRU list and considering count==1 to > > be trivially freeable. > > > > That way you don't have to have separate functions for releasing > > different kinds of pages (we've seen how nasty that was from a > > maintainance standpoint already with the "put_page vs > > page_cache_release" thing). > > > > Ehh? > > If every structure locks before removing its reference (ie before testing and/or > removing a lru reference we take zone->lru_lock, for slabs take cachep->spinlock > etc) Its a bit of an audit task to make sure the various locks are taken (and > documented) though. > > By leting the actual free be lazy as Linus suggests things should simplify nicely. Well we wouldn't want to leave tons of free pages on the LRU - the VM would needlessly reclaim pagecache before finding the free pages. And higher-order page allocations could suffer. If we go for explicit lru removal in truncate and zap_pte_range then this approach may be best. Still thinking about it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: MM patches against 2.5.31 2002-08-26 23:58 ` Andrew Morton @ 2002-08-27 0:13 ` Rik van Riel 0 siblings, 0 replies; 37+ messages in thread From: Rik van Riel @ 2002-08-27 0:13 UTC (permalink / raw) To: Andrew Morton Cc: Ed Tomlinson, linux-mm, linux-kernel, Christian Ehrhardt, Daniel Phillips On Mon, 26 Aug 2002, Andrew Morton wrote: > Well we wouldn't want to leave tons of free pages on the LRU - the VM > would needlessly reclaim pagecache before finding the free pages. And > higher-order page allocations could suffer. We did this with the swap cache in 2.4.<7 and it was an absolute disaster. regards, Rik -- Bravely reimplemented by the knights who say "NIH". http://www.surriel.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2002-08-28 22:57 UTC | newest] Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-08-22 2:29 MM patches against 2.5.31 Andrew Morton 2002-08-22 11:28 ` Christian Ehrhardt 2002-08-26 1:52 ` Andrew Morton 2002-08-26 9:10 ` Christian Ehrhardt 2002-08-26 14:22 ` Daniel Phillips 2002-08-26 15:29 ` Christian Ehrhardt 2002-08-26 17:56 ` Daniel Phillips 2002-08-26 19:24 ` Andrew Morton 2002-08-26 19:34 ` Daniel Phillips 2002-08-26 19:48 ` Christian Ehrhardt 2002-08-27 9:22 ` Christian Ehrhardt 2002-08-27 19:19 ` Andrew Morton 2002-08-26 20:00 ` Christian Ehrhardt 2002-08-26 20:09 ` Daniel Phillips 2002-08-26 20:58 ` Christian Ehrhardt 2002-08-27 16:48 ` Daniel Phillips 2002-08-28 13:14 ` Christian Ehrhardt 2002-08-28 17:18 ` Daniel Phillips 2002-08-28 17:42 ` Andrew Morton 2002-08-28 20:41 ` Daniel Phillips 2002-08-28 21:03 ` Andrew Morton 2002-08-28 22:04 ` Daniel Phillips 2002-08-28 22:39 ` Andrew Morton 2002-08-28 22:57 ` Daniel Phillips 2002-08-26 21:31 ` Andrew Morton 2002-08-27 3:42 ` Benjamin LaHaise 2002-08-27 4:37 ` Andrew Morton 2002-08-22 15:59 ` Steven Cole 2002-08-22 16:06 ` Martin J. Bligh 2002-08-22 19:45 ` Steven Cole 2002-08-26 2:15 ` Andrew Morton 2002-08-26 2:08 ` Martin J. Bligh 2002-08-26 2:32 ` Andrew Morton 2002-08-26 3:06 ` Steven Cole 2002-08-26 22:09 Ed Tomlinson 2002-08-26 23:58 ` Andrew Morton 2002-08-27 0:13 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox