Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Robin Holt <holt@sgi.com>
To: Hugh Dickins <hugh@veritas.com>
Cc: Robin Holt <holt@sgi.com>, Nick Piggin <nickpiggin@yahoo.com.au>,
	Ingo Molnar <mingo@elte.hu>, Christoph Lameter <clameter@sgi.com>,
	Jack Steiner <steiner@sgi.com>, Rik van Riel <riel@redhat.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	linux-mm@kvack.org
Subject: Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
Date: Tue, 24 Jun 2008 10:19:08 -0500	[thread overview]
Message-ID: <20080624151908.GO10062@sgi.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0806232134330.19691@blonde.site>

OK.  I just gave it a try and I still get a failure with this patch
applied.

I can't help wondering if we (XPMEM, IB, etc) shouldn't be setting a
page flag indicating that the attempt to swap this page out should not
even be tried.  I would guess this has already been discussed to death.
If so, I am sorry, but I missed those discussions.

Thanks,
Robin

On Mon, Jun 23, 2008 at 09:58:42PM +0100, Hugh Dickins wrote:
> On Mon, 23 Jun 2008, Robin Holt wrote:
> > On Mon, Jun 23, 2008 at 05:48:17PM +0100, Hugh Dickins wrote:
> > 
> > > reuse test in do_wp_page that I'm still working on - of which Nick sent
> > > a lock_page approximation for you to try?  Would you still be able to
> > > try mine when I'm ready, or does it now appear irrelevant to you?
> > 
> > Before your response, I had convinced myself my problem was specific to
> > XPMEM, but I see your point and may agree that it is a problem for all
> > get_user_pages() users.
> > 
> > I can certainly test when you have it ready.
> 
> Thanks a lot, Robin.  Here it is below.
> 
> > 
> > I had confused myself about Nick's first patch.  I will give that
> > another look over and see if it fixes the problem.
> 
> Nick's _first_ patch?  The one I was thinking of was the one
> with lock_page in do_wp_page, but it shouldn't be necessary now if
> what's below works - though his is a much smaller and less worrying
> patch, so anyone looking for a quick fix to the issue might well
> prefer his.
> 
> > 
> > > 	http://lkml.org/lkml/2006/9/14/384
> > > 
> > > but it's a broken thread, with misunderstanding on all sides,
> > > so rather hard to get a grasp of it.
> > 
> > That is extremely similar to the issue I am seeing.  I think that if
> > Infiniband were using the mmu_notifier stuff, they would be closer, but
> > IIRC, there are significant hardware restrictions which prevent demand
> > paging for working on some IB devices.
> 
> Ah, I'm glad you've managed to glean something from it, good.
> 
> Here's the rollup of the patches I'm proposing for two issues:
> it doesn't get my signoff yet because I'll want to split it into
> little stages properly commented, and I'll want to do more strenuous
> testing; but this shouldn't blow up in your face.  Against 2.6.26-rc7,
> should apply quite easily to earlier, but a little more work against
> 2.6.26-rc5-mm3 - though it simplifies some of that too (Rik+Lee Cced).
> 
> I say two issues, two competing issues.  One is the issue which may be
> your issue, that we want to decide COW in do_wp_page without needing
> PageLocked, so a concurrent shrink_page_list() doesn't occasionally
> force an unwanted COW, messing up some get_user_pages() uses.  The
> other, of no interest to you, is that we do want PageLocked when
> deciding COW in do_wp_page, because that's a good moment to free
> up the swap space - leaving the modified page with swap just makes
> a big seek likely when it's next written to swap.
> 
> Hugh
> 
>  include/linux/swap.h |   21 ++++--
>  mm/memory.c          |   15 +++-
>  mm/migrate.c         |    5 -
>  mm/page_io.c         |    2 
>  mm/swap_state.c      |   12 ++-
>  mm/swapfile.c        |  128 ++++++++++++++++++++++-------------------
>  6 files changed, 105 insertions(+), 78 deletions(-)
> 
> --- 2.6.26-rc7/include/linux/swap.h	2008-05-03 21:55:11.000000000 +0100
> +++ linux/include/linux/swap.h	2008-06-23 18:12:47.000000000 +0100
> @@ -250,8 +250,9 @@ extern unsigned int count_swap_pages(int
>  extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern struct swap_info_struct *get_swap_info_struct(unsigned);
> -extern int can_share_swap_page(struct page *);
> -extern int remove_exclusive_swap_page(struct page *);
> +extern int can_reuse_swap_page_unlocked(struct page *);
> +extern int can_reuse_swap_page_locked(struct page *);
> +extern int try_to_free_swap(struct page *);
>  struct backing_dev_info;
>  
>  extern spinlock_t swap_lock;
> @@ -319,8 +320,6 @@ static inline struct page *lookup_swap_c
>  	return NULL;
>  }
>  
> -#define can_share_swap_page(p)			(page_mapcount(p) == 1)
> -
>  static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
>  							gfp_t gfp_mask)
>  {
> @@ -337,9 +336,19 @@ static inline void delete_from_swap_cach
>  
>  #define swap_token_default_timeout		0
>  
> -static inline int remove_exclusive_swap_page(struct page *p)
> +static inline int can_reuse_swap_page_unlocked(struct page *page)
>  {
> -	return 0;
> +	return 0;	/* irrelevant: never called  */
> +}
> +
> +static inline int can_reuse_swap_page_locked(struct page *page)
> +{
> +	return 0;	/* irrelevant: never called */
> +}
> +
> +static inline int try_to_free_swap(struct page *page)
> +{
> +	return 0;	/* irrelevant: never called */
>  }
>  
>  static inline swp_entry_t get_swap_page(void)
> --- 2.6.26-rc7/mm/memory.c	2008-06-21 08:41:19.000000000 +0100
> +++ linux/mm/memory.c	2008-06-23 18:12:47.000000000 +0100
> @@ -1686,9 +1686,14 @@ static int do_wp_page(struct mm_struct *
>  	 * not dirty accountable.
>  	 */
>  	if (PageAnon(old_page)) {
> -		if (!TestSetPageLocked(old_page)) {
> -			reuse = can_share_swap_page(old_page);
> -			unlock_page(old_page);
> +		if (page_mapcount(old_page) == 1) {
> +			if (!PageSwapCache(old_page))
> +				reuse = 1;
> +			else if (!TestSetPageLocked(old_page)) {
> +				reuse = can_reuse_swap_page_locked(old_page);
> +				unlock_page(old_page);
> +			} else
> +				reuse = can_reuse_swap_page_unlocked(old_page);
>  		}
>  	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>  					(VM_WRITE|VM_SHARED))) {
> @@ -2185,7 +2190,7 @@ static int do_swap_page(struct mm_struct
>  
>  	inc_mm_counter(mm, anon_rss);
>  	pte = mk_pte(page, vma->vm_page_prot);
> -	if (write_access && can_share_swap_page(page)) {
> +	if (write_access && can_reuse_swap_page_locked(page)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		write_access = 0;
>  	}
> @@ -2196,7 +2201,7 @@ static int do_swap_page(struct mm_struct
>  
>  	swap_free(entry);
>  	if (vm_swap_full())
> -		remove_exclusive_swap_page(page);
> +		try_to_free_swap(page);
>  	unlock_page(page);
>  
>  	if (write_access) {
> --- 2.6.26-rc7/mm/migrate.c	2008-06-21 08:41:19.000000000 +0100
> +++ linux/mm/migrate.c	2008-06-23 18:12:47.000000000 +0100
> @@ -330,8 +330,10 @@ static int migrate_page_move_mapping(str
>  	get_page(newpage);	/* add cache reference */
>  #ifdef CONFIG_SWAP
>  	if (PageSwapCache(page)) {
> -		SetPageSwapCache(newpage);
>  		set_page_private(newpage, page_private(page));
> +		/* page_swapcount() relies on private whenever PageSwapCache */
> +		smp_wmb();
> +		SetPageSwapCache(newpage);
>  	}
>  #endif
>  
> @@ -398,7 +400,6 @@ static void migrate_page_copy(struct pag
>  #endif
>  	ClearPageActive(page);
>  	ClearPagePrivate(page);
> -	set_page_private(page, 0);
>  	page->mapping = NULL;
>  
>  	/*
> --- 2.6.26-rc7/mm/page_io.c	2008-04-17 03:49:44.000000000 +0100
> +++ linux/mm/page_io.c	2008-06-23 18:12:47.000000000 +0100
> @@ -98,7 +98,7 @@ int swap_writepage(struct page *page, st
>  	struct bio *bio;
>  	int ret = 0, rw = WRITE;
>  
> -	if (remove_exclusive_swap_page(page)) {
> +	if (try_to_free_swap(page)) {
>  		unlock_page(page);
>  		goto out;
>  	}
> --- 2.6.26-rc7/mm/swap_state.c	2008-05-03 21:55:12.000000000 +0100
> +++ linux/mm/swap_state.c	2008-06-23 18:12:47.000000000 +0100
> @@ -76,13 +76,15 @@ int add_to_swap_cache(struct page *page,
>  	BUG_ON(PagePrivate(page));
>  	error = radix_tree_preload(gfp_mask);
>  	if (!error) {
> +		set_page_private(page, entry.val);
> +		/* page_swapcount() relies on private whenever PageSwapCache */
> +		smp_wmb();
>  		write_lock_irq(&swapper_space.tree_lock);
>  		error = radix_tree_insert(&swapper_space.page_tree,
>  						entry.val, page);
>  		if (!error) {
>  			page_cache_get(page);
>  			SetPageSwapCache(page);
> -			set_page_private(page, entry.val);
>  			total_swapcache_pages++;
>  			__inc_zone_page_state(page, NR_FILE_PAGES);
>  			INC_CACHE_INFO(add_total);
> @@ -105,7 +107,6 @@ void __delete_from_swap_cache(struct pag
>  	BUG_ON(PagePrivate(page));
>  
>  	radix_tree_delete(&swapper_space.page_tree, page_private(page));
> -	set_page_private(page, 0);
>  	ClearPageSwapCache(page);
>  	total_swapcache_pages--;
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
> @@ -188,13 +189,14 @@ void delete_from_swap_cache(struct page 
>   * 
>   * Its ok to check for PageSwapCache without the page lock
>   * here because we are going to recheck again inside 
> - * exclusive_swap_page() _with_ the lock. 
> + * try_to_free_swap() _with_ the lock. 
>   * 					- Marcelo
>   */
>  static inline void free_swap_cache(struct page *page)
>  {
> -	if (PageSwapCache(page) && !TestSetPageLocked(page)) {
> -		remove_exclusive_swap_page(page);
> +	if (PageSwapCache(page) && !page_mapped(page) &&
> +	    !TestSetPageLocked(page)) {
> +		try_to_free_swap(page);
>  		unlock_page(page);
>  	}
>  }
> --- 2.6.26-rc7/mm/swapfile.c	2008-05-03 21:55:12.000000000 +0100
> +++ linux/mm/swapfile.c	2008-06-23 18:12:47.000000000 +0100
> @@ -251,7 +251,6 @@ static struct swap_info_struct * swap_in
>  		goto bad_offset;
>  	if (!p->swap_map[offset])
>  		goto bad_free;
> -	spin_lock(&swap_lock);
>  	return p;
>  
>  bad_free:
> @@ -300,90 +299,104 @@ void swap_free(swp_entry_t entry)
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> +		spin_lock(&swap_lock);
>  		swap_entry_free(p, swp_offset(entry));
>  		spin_unlock(&swap_lock);
>  	}
>  }
>  
>  /*
> - * How many references to page are currently swapped out?
> + * How many page table references are there to this page's swap entry?
> + * including references to the page itself if add_mapcount.
> + *
> + * page_swapcount is only called on a page which was recently PageSwapCache
> + * (but might have been deleted from swap cache just before getting here).
> + *
> + * When called with PageLocked, the swapcount is stable, and the mapcount
> + * cannot rise (but may fall if the page is concurrently unmapped from a
> + * page table whose lock we don't hold).
> + *
> + * When called without PageLocked, the swapcount is stable while we hold
> + * swap_lock, and the mapcount cannot rise from 1 while we hold that page
> + * table lock (but may fall if the page is concurrently unmapped from a
> + * page table whose lock we don't hold).
> + *
> + * do_swap_page and unuse_pte call page_add_anon_rmap before swap_free,
> + * try_to_unmap_one calls swap_duplicate before page_remove_rmap:
> + * so in general, swapcount+mapcount should never be seen too low -
> + * but you need to consider more memory barriers if extending its use.
>   */
> -static inline int page_swapcount(struct page *page)
> +static int page_swapcount(struct page *page, int add_mapcount)
>  {
>  	int count = 0;
>  	struct swap_info_struct *p;
>  	swp_entry_t entry;
>  
> -	entry.val = page_private(page);
> -	p = swap_info_get(entry);
> -	if (p) {
> -		/* Subtract the 1 for the swap cache itself */
> -		count = p->swap_map[swp_offset(entry)] - 1;
> -		spin_unlock(&swap_lock);
> +	spin_lock(&swap_lock);
> +	if (add_mapcount)
> +		count = page_mapcount(page);
> +	if (PageSwapCache(page)) {
> +		/* We can rely on page_private once PageSwapCache is visible */
> +		smp_rmb();
> +		entry.val = page_private(page);
> +		p = swap_info_get(entry);
> +		if (p) {
> +			/* Subtract the 1 for the swap cache itself */
> +			count += p->swap_map[swp_offset(entry)] - 1;
> +		}
>  	}
> +	spin_unlock(&swap_lock);
>  	return count;
>  }
>  
>  /*
> - * We can use this swap cache entry directly
> - * if there are no other references to it.
> + * Can do_wp_page() make the faulting swapcache page writable without COW?
> + * But something else, probably shrink_page_list(), already has PageLocked.
> + * Don't be misled to COW the page unnecessarily: check swapcount+mapcount.
>   */
> -int can_share_swap_page(struct page *page)
> +int can_reuse_swap_page_unlocked(struct page *page)
>  {
> -	int count;
> -
> -	BUG_ON(!PageLocked(page));
> -	count = page_mapcount(page);
> -	if (count <= 1 && PageSwapCache(page))
> -		count += page_swapcount(page);
> -	return count == 1;
> +	return page_swapcount(page, 1) == 1;
>  }
>  
>  /*
> - * Work out if there are any other processes sharing this
> - * swap cache page. Free it if you can. Return success.
> + * Can do_wp_page() make the faulting swapcache page writable without COW?
> + * having acquiring PageLocked.  In this case, since we have that lock and
> + * are about to modify the page, we'd better free its swap space - it won't
> + * be read again, and writing there later would probably require an extra seek.
>   */
> -int remove_exclusive_swap_page(struct page *page)
> +int can_reuse_swap_page_locked(struct page *page)
>  {
> -	int retval;
> -	struct swap_info_struct * p;
> -	swp_entry_t entry;
> +	VM_BUG_ON(!PageLocked(page));
> +	if (page_swapcount(page, 1) != 1)
> +		return 0;
> +	if (!PageSwapCache(page))
> +		return 1;
> +	if (PageWriteback(page))
> +		return 1;
>  
> -	BUG_ON(PagePrivate(page));
> -	BUG_ON(!PageLocked(page));
> +	delete_from_swap_cache(page);
> +	SetPageDirty(page);
> +	return 1;
> +}
>  
> +/*
> + * If swap is getting full, or if there are no more mappings of this page,
> + * then try_to_free_swap is called to free its swap space.
> + */
> +int try_to_free_swap(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page));
>  	if (!PageSwapCache(page))
>  		return 0;
>  	if (PageWriteback(page))
>  		return 0;
> -	if (page_count(page) != 2) /* 2: us + cache */
> +	if (page_swapcount(page, 0))	/* Here we don't care about mapcount */
>  		return 0;
>  
> -	entry.val = page_private(page);
> -	p = swap_info_get(entry);
> -	if (!p)
> -		return 0;
> -
> -	/* Is the only swap cache user the cache itself? */
> -	retval = 0;
> -	if (p->swap_map[swp_offset(entry)] == 1) {
> -		/* Recheck the page count with the swapcache lock held.. */
> -		write_lock_irq(&swapper_space.tree_lock);
> -		if ((page_count(page) == 2) && !PageWriteback(page)) {
> -			__delete_from_swap_cache(page);
> -			SetPageDirty(page);
> -			retval = 1;
> -		}
> -		write_unlock_irq(&swapper_space.tree_lock);
> -	}
> -	spin_unlock(&swap_lock);
> -
> -	if (retval) {
> -		swap_free(entry);
> -		page_cache_release(page);
> -	}
> -
> -	return retval;
> +	delete_from_swap_cache(page);
> +	SetPageDirty(page);
> +	return 1;
>  }
>  
>  /*
> @@ -400,6 +413,7 @@ void free_swap_and_cache(swp_entry_t ent
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> +		spin_lock(&swap_lock);
>  		if (swap_entry_free(p, swp_offset(entry)) == 1) {
>  			page = find_get_page(&swapper_space, entry.val);
>  			if (page && unlikely(TestSetPageLocked(page))) {
> @@ -410,14 +424,10 @@ void free_swap_and_cache(swp_entry_t ent
>  		spin_unlock(&swap_lock);
>  	}
>  	if (page) {
> -		int one_user;
> -
> -		BUG_ON(PagePrivate(page));
> -		one_user = (page_count(page) == 2);
> -		/* Only cache user (+us), or swap space full? Free it! */
> +		/* Not mapped elsewhere, or swap space full? Free it! */
>  		/* Also recheck PageSwapCache after page is locked (above) */
>  		if (PageSwapCache(page) && !PageWriteback(page) &&
> -					(one_user || vm_swap_full())) {
> +				(!page_mapped(page) || vm_swap_full())) {
>  			delete_from_swap_cache(page);
>  			SetPageDirty(page);
>  		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2008-06-24 15:19 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-18 16:41 Robin Holt
2008-06-18 17:29 ` Nick Piggin
2008-06-18 19:01   ` Hugh Dickins
2008-06-18 20:33     ` Robin Holt
2008-06-18 21:46       ` Hugh Dickins
2008-06-19  3:31         ` Nick Piggin
2008-06-19  3:34           ` Nick Piggin
2008-06-19 11:39           ` Hugh Dickins
2008-06-19 12:07             ` Nick Piggin
2008-06-19 12:21               ` Nick Piggin
2008-06-19 17:48                 ` Christoph Lameter
2008-06-19 12:34               ` Hugh Dickins
2008-06-19 12:53                 ` Nick Piggin
2008-06-19 13:25                   ` Hugh Dickins
2008-06-19 13:35                     ` Robin Holt
2008-06-19 16:32         ` Robin Holt
2008-06-20  9:23           ` Nick Piggin
2008-06-19  3:07     ` Nick Piggin
2008-06-19 11:09       ` Hugh Dickins
2008-06-19 13:38         ` Robin Holt
2008-06-19 13:49           ` Hugh Dickins
2008-06-23 15:54             ` Robin Holt
2008-06-23 16:48               ` Hugh Dickins
2008-06-23 17:52                 ` Robin Holt
2008-06-23 20:58                   ` Hugh Dickins
2008-06-24 11:56                     ` Robin Holt
2008-06-24 15:19                     ` Robin Holt [this message]
2008-06-24 20:19                       ` Hugh Dickins
2008-06-23 19:11             ` Robin Holt
2008-06-23 19:12               ` Robin Holt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080624151908.GO10062@sgi.com \
    --to=holt@sgi.com \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=clameter@sgi.com \
    --cc=hugh@veritas.com \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=nickpiggin@yahoo.com.au \
    --cc=riel@redhat.com \
    --cc=steiner@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox