linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Rafael Aquini <aquini@redhat.com>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Alex Shi <alex.shi@linux.alibaba.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH mmotm] mm/swap: fix livelock in __read_swap_cache_async()
Date: Fri, 22 May 2020 13:08:44 -0400	[thread overview]
Message-ID: <20200522170844.GA85134@optiplex-fbsd> (raw)
In-Reply-To: <alpine.LSU.2.11.2005212246080.8458@eggly.anvils>

On Thu, May 21, 2020 at 10:56:20PM -0700, Hugh Dickins wrote:
> I've only seen this livelock on one machine (repeatably, but not to
> order), and not fully analyzed it - two processes seen looping around
> getting -EEXIST from swapcache_prepare(), I guess a third (at lower
> priority? but wanting the same cpu as one of the loopers? preemption
> or cond_resched() not enough to let it back in?) set SWAP_HAS_CACHE,
> then went off into direct reclaim, scheduled away, and somehow could
> not get back to add the page to swap cache and let them all complete.
> 
> Restore the page allocation in __read_swap_cache_async() to before
> the swapcache_prepare() call: "mm: memcontrol: charge swapin pages
> on instantiation" moved it outside the loop, which indeed looks much
> nicer, but exposed this weakness.  We used to allocate new_page once
> and then keep it across all iterations of the loop: but I think that
> just optimizes for a rare case, and complicates the flow, so go with
> the new simpler structure, with allocate+free each time around (which
> is more considerate use of the memory too).
> 
> Fix the comment on the looping case, which has long been inaccurate:
> it's not a racing get_swap_page() that's the problem here.
> 
> Fix the add_to_swap_cache() and mem_cgroup_charge() error recovery:
> not swap_free(), but put_swap_page() to undo SWAP_HAS_CACHE, as was
> done before; but delete_from_swap_cache() already includes it.
> 
> And one more nit: I don't think it makes any difference in practice,
> but remove the "& GFP_KERNEL" mask from the mem_cgroup_charge() call:
> add_to_swap_cache() needs that, to convert gfp_mask from user and page
> cache allocation (e.g. highmem) to radix node allocation (lowmem), but
> we don't need or usually apply that mask when charging mem_cgroup.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> Mostly fixing mm-memcontrol-charge-swapin-pages-on-instantiation.patch
> but now I see that mm-memcontrol-delete-unused-lrucare-handling.patch
> made a further change here (took an arg off the mem_cgroup_charge call):
> as is, this patch is diffed to go on top of both of them, and better
> that I get it out now for Johannes look at; but could be rediffed for
> folding into blah-instantiation.patch later.
> 
> Earlier in the day I promised two patches to __read_swap_cache_async(),
> but find now that I cannot quite justify the second patch: it makes a
> slight adjustment in swapcache_prepare(), then removes the redundant
> __swp_swapcount() && swap_slot_cache_enabled business from blah_async().
> 
> I'd still like to do that, but this patch here brings back the
> alloc_page_vma() in between them, and I don't have any evidence to
> reassure us that I'm not then pessimizing a readahead case by doing
> unnecessary allocation and free. Leave it for some other time perhaps.
> 
>  mm/swap_state.c |   52 +++++++++++++++++++++++++---------------------
>  1 file changed, 29 insertions(+), 23 deletions(-)
> 
> --- 5.7-rc6-mm1/mm/swap_state.c	2020-05-20 12:21:56.149694170 -0700
> +++ linux/mm/swap_state.c	2020-05-21 20:17:50.188773901 -0700
> @@ -392,56 +392,62 @@ struct page *__read_swap_cache_async(swp
>  			return NULL;
>  
>  		/*
> +		 * Get a new page to read into from swap.  Allocate it now,
> +		 * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will
> +		 * cause any racers to loop around until we add it to cache.
> +		 */
> +		page = alloc_page_vma(gfp_mask, vma, addr);
> +		if (!page)
> +			return NULL;
> +
> +		/*
>  		 * Swap entry may have been freed since our caller observed it.
>  		 */
>  		err = swapcache_prepare(entry);
>  		if (!err)
>  			break;
>  
> -		if (err == -EEXIST) {
> -			/*
> -			 * We might race against get_swap_page() and stumble
> -			 * across a SWAP_HAS_CACHE swap_map entry whose page
> -			 * has not been brought into the swapcache yet.
> -			 */
> -			cond_resched();
> -			continue;
> -		}
> +		put_page(page);
> +		if (err != -EEXIST)
> +			return NULL;
>  
> -		return NULL;
> +		/*
> +		 * We might race against __delete_from_swap_cache(), and
> +		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
> +		 * has not yet been cleared.  Or race against another
> +		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> +		 * in swap_map, but not yet added its page to swap cache.
> +		 */
> +		cond_resched();
>  	}
>  
>  	/*
> -	 * The swap entry is ours to swap in. Prepare a new page.
> +	 * The swap entry is ours to swap in. Prepare the new page.
>  	 */
>  
> -	page = alloc_page_vma(gfp_mask, vma, addr);
> -	if (!page)
> -		goto fail_free;
> -
>  	__SetPageLocked(page);
>  	__SetPageSwapBacked(page);
>  
>  	/* May fail (-ENOMEM) if XArray node allocation failed. */
> -	if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
> +	if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
> +		put_swap_page(page, entry);
>  		goto fail_unlock;
> +	}
>  
> -	if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL))
> -		goto fail_delete;
> +	if (mem_cgroup_charge(page, NULL, gfp_mask)) {
> +		delete_from_swap_cache(page);
> +		goto fail_unlock;
> +	}
>  
> -	/* Initiate read into locked page */
> +	/* Caller will initiate read into locked page */
>  	SetPageWorkingset(page);
>  	lru_cache_add_anon(page);
>  	*new_page_allocated = true;
>  	return page;
>  
> -fail_delete:
> -	delete_from_swap_cache(page);
>  fail_unlock:
>  	unlock_page(page);
>  	put_page(page);
> -fail_free:
> -	swap_free(entry);
>  	return NULL;
>  }
>  
> 
Acked-by: Rafael Aquini <aquini@redhat.com>



  reply	other threads:[~2020-05-22 17:08 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-22  5:56 Hugh Dickins
2020-05-22 17:08 ` Rafael Aquini [this message]
2020-05-23  0:24 ` Andrew Morton
2020-05-26 15:45 ` Johannes Weiner
2020-05-27 21:44   ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200522170844.GA85134@optiplex-fbsd \
    --to=aquini@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linux.alibaba.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox