From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [rfc][patch 2/2] mm: mlocked pages off LRU From: Lee Schermerhorn In-Reply-To: <20070306143045.GA28629@wotan.suse.de> References: <20070305161746.GD8128@wotan.suse.de> <20070306010529.GB23845@wotan.suse.de> <20070306014403.GD23845@wotan.suse.de> <20070306021307.GE23845@wotan.suse.de> <20070306025016.GA1912@wotan.suse.de> <20070306143045.GA28629@wotan.suse.de> Content-Type: text/plain Date: Tue, 06 Mar 2007 17:23:55 -0500 Message-Id: <1173219835.20580.15.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Nick Piggin Cc: Christoph Lameter , Christoph Lameter , Linux Memory Management List , Andrew Morton , Christoph Hellwig List-ID: On Tue, 2007-03-06 at 15:30 +0100, Nick Piggin wrote: > New core patch. This one is actually tested and works, and you can see > the mlocked pages being accounted. > > Same basic idea. Too many fixes and changes to list. Haven't taken up > Christoph's idea to do a union in struct page, but it could be a followup. > > Most importantly (aside from crashes and obvious bugs), it should correctly > synchronise munlock vs vmscan lazy mlock now. Before this, it was possible > to have pages leak. This took me a bit of thinking to get right, but was > rather simple in the end. > > Memory migration should work now, too, but not tested. > > What do people think? Yes? No? Nick: I've grabbed your 2 patches in this series and rebased them to 21-rc2-mm2 so I can test them and compare with Christoph's [which I've also rebased to -mm2]. I had to fix up the ia32_setup_arg_pages() for ia64 to track the change you made to install_new_arg_page. Patch included below. Some comments in-line below, as well. Now builds, boots, and successfully builds a kernel with Christoph's series. Some basic testing with memtoy [see link below] shows pages being locked according to the /proc/meminfo stats, but the counts don't decrease when I unmap the segment nor when I exit the task. I'll investigate why and let you know how further testing goes. After that, I plan to merge both series with my page migration series and your page cache replication patch to test the effects there. Should be "interesting". If you're interested, I have a little tool/toy for testing mm stuff at: http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz I recently added a lock()/unlock() command for testing locking of memory regions. It could use more [a lot more] documentation, but there it does have a README and internal help. Lee > > -- > > Index: linux-2.6/mm/mlock.c > =================================================================== > --- linux-2.6.orig/mm/mlock.c > +++ linux-2.6/mm/mlock.c > @@ -8,17 +8,204 @@ > #include > #include > #include > +#include > +#include > #include > #include > > +#include "internal.h" > + > +#define page_mlock_count(page) (*(unsigned long *)&(page)->lru.next) > +#define set_page_mlock_count(page, v) (page_mlock_count(page) = (v)) > +#define inc_page_mlock_count(page) (page_mlock_count(page)++) > +#define dec_page_mlock_count(page) (page_mlock_count(page)--) > + > +/* > + * A page's mlock_count is kept in page->lru.next as an unsigned long. > + * Access to this count is serialised with the page lock (or, in the > + * case of mlock_page, virtue that there are no other references to > + * the page). > + * > + * mlock counts are incremented at mlock, mmap, mremap, and new anon page > + * faults, and lazily via vmscan. Decremented at munlock, munmap, and exit. > + * mlock is not inherited across fork or exec, so we're safe there. > + * > + * If PageMLock is set, then the page is removed from the LRU list, and > + * has its refcount incremented. This increment prevents the page from being > + * freed until the mlock_count is decremented to zero and PageMLock is cleared. > + * > + * When lazy incrementing via vmscan, it is important to ensure that the > + * vma's VM_LOCKED status is not concurrently being modified, otherwise we > + * may have elevated mlock_count of a page that is being munlocked. So lazy > + * mlocked must take the mmap_sem for read, and verify that the vma really > + * is locked (see mm/rmap.c). > + */ > + > +/* > + * Marks a page, belonging to the given mlocked vma, as mlocked. > + * > + * The page must be either locked or new, and must not be on the LRU. > + */ > +static void __set_page_mlock(struct page *page) > +{ > + BUG_ON(PageLRU(page)); > + BUG_ON(PageMLock(page)); > + /* BUG_ON(!list_empty(&page->lru)); -- if we always did list_del_init */ > + > + SetPageMLock(page); > + get_page(page); > + inc_zone_page_state(page, NR_MLOCK); > + set_page_mlock_count(page, 1); > +} > + > +static void __clear_page_mlock(struct page *page) > +{ > + BUG_ON(!PageMLock(page)); > + BUG_ON(PageLRU(page)); > + BUG_ON(page_mlock_count(page)); > + > + dec_zone_page_state(page, NR_MLOCK); > + ClearPageMLock(page); > + lru_cache_add_active(page); > + put_page(page); > +} > + > +/* > + * Zero the page's mlock_count. This can be useful in a situation where > + * we want to unconditionally remove a page from the pagecache. > + * > + * It is not illegal to call this function for any page, mlocked or not. Maybe "It is legal ..." ??? > + * If called for a page that is still mapped by mlocked vmas, all we do > + * is revert to lazy LRU behaviour -- semantics are not broken. > + */ > +void clear_page_mlock(struct page *page) > +{ > + BUG_ON(!PageLocked(page)); > + > + if (likely(!PageMLock(page))) > + return; > + BUG_ON(!page_mlock_count(page)); > + set_page_mlock_count(page, 0); > + __clear_page_mlock(page); > +} > + > Index: linux-2.6/include/linux/page-flags.h > =================================================================== > --- linux-2.6.orig/include/linux/page-flags.h > +++ linux-2.6/include/linux/page-flags.h > @@ -91,6 +91,7 @@ > #define PG_nosave_free 18 /* Used for system suspend/resume */ > #define PG_buddy 19 /* Page is free, on buddy lists */ > > +#define PG_mlock 20 /* Page has mlocked vmas */ Conflicts with PG_readahead in 21-rc2-mm2. I temporarily used bit 30--valid only for 64-bit systems. [Same in Christoph's series.] > > #if (BITS_PER_LONG > 32) > /* > @@ -247,6 +248,10 @@ static inline void SetPageUptodate(struc > #define PageSwapCache(page) 0 > #endif > > +#define PageMLock(page) test_bit(PG_mlock, &(page)->flags) > +#define SetPageMLock(page) set_bit(PG_mlock, &(page)->flags) > +#define ClearPageMLock(page) clear_bit(PG_mlock, &(page)->flags) > + > #define PageUncached(page) test_bit(PG_uncached, &(page)->flags) > #define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags) > #define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags) > Index: linux-2.6/mm/page_alloc.c > =================================================================== > --- linux-2.6.orig/mm/page_alloc.c > +++ linux-2.6/mm/page_alloc.c > @@ -203,7 +203,8 @@ static void bad_page(struct page *page) > 1 << PG_slab | > 1 << PG_swapcache | > 1 << PG_writeback | > - 1 << PG_buddy ); > + 1 << PG_buddy | > + 1 << PG_mlock ); > set_page_count(page, 0); > reset_page_mapcount(page); > page->mapping = NULL; > @@ -438,7 +439,8 @@ static inline int free_pages_check(struc > 1 << PG_swapcache | > 1 << PG_writeback | > 1 << PG_reserved | > - 1 << PG_buddy )))) > + 1 << PG_buddy | > + 1 << PG_mlock )))) > bad_page(page); > if (PageDirty(page)) > __ClearPageDirty(page); > @@ -588,7 +590,8 @@ static int prep_new_page(struct page *pa > 1 << PG_swapcache | > 1 << PG_writeback | > 1 << PG_reserved | > - 1 << PG_buddy )))) > + 1 << PG_buddy | > + 1 << PG_mlock )))) > bad_page(page); > > /* > Index: linux-2.6/fs/exec.c > =================================================================== > --- linux-2.6.orig/fs/exec.c > +++ linux-2.6/fs/exec.c > @@ -297,44 +297,6 @@ int copy_strings_kernel(int argc,char ** > EXPORT_SYMBOL(copy_strings_kernel); > > #ifdef CONFIG_MMU > -/* > - * This routine is used to map in a page into an address space: needed by > - * execve() for the initial stack and environment pages. > - * > - * vma->vm_mm->mmap_sem is held for writing. > - */ > -void install_arg_page(struct vm_area_struct *vma, > - struct page *page, unsigned long address) > -{ > - struct mm_struct *mm = vma->vm_mm; > - pte_t * pte; > - spinlock_t *ptl; > - > - if (unlikely(anon_vma_prepare(vma))) > - goto out; > - > - flush_dcache_page(page); > - pte = get_locked_pte(mm, address, &ptl); > - if (!pte) > - goto out; > - if (!pte_none(*pte)) { > - pte_unmap_unlock(pte, ptl); > - goto out; > - } > - inc_mm_counter(mm, anon_rss); > - lru_cache_add_active(page); > - set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte( > - page, vma->vm_page_prot)))); > - page_add_new_anon_rmap(page, vma, address); > - pte_unmap_unlock(pte, ptl); > - > - /* no need for flush_tlb */ > - return; > -out: > - __free_page(page); > - force_sig(SIGKILL, current); > -} > - > #define EXTRA_STACK_VM_PAGES 20 /* random */ > > int setup_arg_pages(struct linux_binprm *bprm, > @@ -438,17 +400,25 @@ int setup_arg_pages(struct linux_binprm > mm->stack_vm = mm->total_vm = vma_pages(mpnt); > } > > + ret = 0; > for (i = 0 ; i < MAX_ARG_PAGES ; i++) { > struct page *page = bprm->page[i]; > if (page) { > bprm->page[i] = NULL; > - install_arg_page(mpnt, page, stack_base); > + if (!ret) > + ret = install_new_anon_page(mpnt, page, > + stack_base); > + if (ret) > + put_page(page); Need similar mod in arch/ia64/ia32/binfmt_elf32.c:ia32_setup_arg_pages() Patch included below. > } > stack_base += PAGE_SIZE; > } > up_write(&mm->mmap_sem); > - > - return 0; > + > + if (ret) > + do_munmap(mm, mpnt->vm_start, mpnt->vm_start - mpnt->vm_end); > + > + return ret; > } > > EXPORT_SYMBOL(setup_arg_pages); > Index: linux-2.6/mm/migrate.c > =================================================================== > --- linux-2.6.orig/mm/migrate.c > +++ linux-2.6/mm/migrate.c > @@ -272,6 +272,8 @@ static int migrate_page_move_mapping(str > return 0; > } > > + clear_page_mlock(page); > + > write_lock_irq(&mapping->tree_lock); > > pslot = radix_tree_lookup_slot(&mapping->page_tree, > @@ -775,6 +777,17 @@ static int do_move_pages(struct mm_struc > !migrate_all) > goto put_and_set; > > + /* > + * Just do the simple thing and put back mlocked pages onto > + * the LRU list so they can be taken off again (inefficient > + * but not a big deal). > + */ > + if (PageMLock(page)) { > + lock_page(page); > + clear_page_mlock(page); Note that this will put the page into the lru pagevec cache [__clear_page_mlock() above] where isolate_lru_page(), called from migrate_page_add(), is unlikely to find it. do_move_pages() has already called migrate_prep() to drain the lru caches so that it is more likely to find the pages, as does check_range() when called to collect pages for migration. Yes, this is already racy--the target task or other threads therein can fault additional pages into the lru cache after call to migrate_prep(). But this almost guarantees we'll miss ~ the last PAGEVEC_SIZE pages. > + unlock_page(page); > + } > + > err = isolate_lru_page(page); > if (err) { > put_and_set: > Index: linux-2.6/mm/mempolicy.c > =================================================================== > --- linux-2.6.orig/mm/mempolicy.c > +++ linux-2.6/mm/mempolicy.c > @@ -89,6 +89,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -224,7 +225,10 @@ static int check_pte_range(struct vm_are > pte_t *orig_pte; > pte_t *pte; > spinlock_t *ptl; > + struct page *mlocked; > > +resume: > + mlocked = NULL; > orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > do { > struct page *page; > @@ -254,12 +258,24 @@ static int check_pte_range(struct vm_are > > if (flags & MPOL_MF_STATS) > gather_stats(page, private, pte_dirty(*pte)); > - else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) > + else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { > + if (PageMLock(page) && !mlocked) { > + mlocked = page; > + break; > + } > migrate_page_add(page, private, flags); > - else > + } else > break; > } while (pte++, addr += PAGE_SIZE, addr != end); > pte_unmap_unlock(orig_pte, ptl); > + > + if (mlocked) { > + lock_page(mlocked); > + clear_page_mlock(mlocked); Same comment as for do_move_pages() above. > + unlock_page(mlocked); > + goto resume; > + } > + > return addr != end; > } > > @@ -372,6 +388,7 @@ check_range(struct mm_struct *mm, unsign > endvma = end; > if (vma->vm_start > start) > start = vma->vm_start; > + > err = check_pgd_range(vma, start, endvma, nodes, > flags, private); > if (err) { Here's the patch mentioned above: Need to replace call to install_arg_page() in ia64's ia32 version of setup_arg_pages() to build 21-rc2-mm2 with Nick's "mlocked pages off LRU" patch on ia64. Signed-off-by: Lee Schermerhorn arch/ia64/ia32/binfmt_elf32.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) Index: Linux/arch/ia64/ia32/binfmt_elf32.c =================================================================== --- Linux.orig/arch/ia64/ia32/binfmt_elf32.c 2007-03-06 12:16:33.000000000 -0500 +++ Linux/arch/ia64/ia32/binfmt_elf32.c 2007-03-06 15:19:02.000000000 -0500 @@ -240,7 +240,11 @@ ia32_setup_arg_pages (struct linux_binpr struct page *page = bprm->page[i]; if (page) { bprm->page[i] = NULL; - install_arg_page(mpnt, page, stack_base); + if (!ret) + ret = install_new_anon_page(mpnt, page, + stack_base); + if (ret) + put_page(page); } stack_base += PAGE_SIZE; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org