From: Rafael Aquini <aquini@redhat.com>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Alex Shi <alex.shi@linux.alibaba.com>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH mmotm] mm/swap: fix livelock in __read_swap_cache_async()
Date: Fri, 22 May 2020 13:08:44 -0400 [thread overview]
Message-ID: <20200522170844.GA85134@optiplex-fbsd> (raw)
In-Reply-To: <alpine.LSU.2.11.2005212246080.8458@eggly.anvils>
On Thu, May 21, 2020 at 10:56:20PM -0700, Hugh Dickins wrote:
> I've only seen this livelock on one machine (repeatably, but not to
> order), and not fully analyzed it - two processes seen looping around
> getting -EEXIST from swapcache_prepare(), I guess a third (at lower
> priority? but wanting the same cpu as one of the loopers? preemption
> or cond_resched() not enough to let it back in?) set SWAP_HAS_CACHE,
> then went off into direct reclaim, scheduled away, and somehow could
> not get back to add the page to swap cache and let them all complete.
>
> Restore the page allocation in __read_swap_cache_async() to before
> the swapcache_prepare() call: "mm: memcontrol: charge swapin pages
> on instantiation" moved it outside the loop, which indeed looks much
> nicer, but exposed this weakness. We used to allocate new_page once
> and then keep it across all iterations of the loop: but I think that
> just optimizes for a rare case, and complicates the flow, so go with
> the new simpler structure, with allocate+free each time around (which
> is more considerate use of the memory too).
>
> Fix the comment on the looping case, which has long been inaccurate:
> it's not a racing get_swap_page() that's the problem here.
>
> Fix the add_to_swap_cache() and mem_cgroup_charge() error recovery:
> not swap_free(), but put_swap_page() to undo SWAP_HAS_CACHE, as was
> done before; but delete_from_swap_cache() already includes it.
>
> And one more nit: I don't think it makes any difference in practice,
> but remove the "& GFP_KERNEL" mask from the mem_cgroup_charge() call:
> add_to_swap_cache() needs that, to convert gfp_mask from user and page
> cache allocation (e.g. highmem) to radix node allocation (lowmem), but
> we don't need or usually apply that mask when charging mem_cgroup.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> Mostly fixing mm-memcontrol-charge-swapin-pages-on-instantiation.patch
> but now I see that mm-memcontrol-delete-unused-lrucare-handling.patch
> made a further change here (took an arg off the mem_cgroup_charge call):
> as is, this patch is diffed to go on top of both of them, and better
> that I get it out now for Johannes look at; but could be rediffed for
> folding into blah-instantiation.patch later.
>
> Earlier in the day I promised two patches to __read_swap_cache_async(),
> but find now that I cannot quite justify the second patch: it makes a
> slight adjustment in swapcache_prepare(), then removes the redundant
> __swp_swapcount() && swap_slot_cache_enabled business from blah_async().
>
> I'd still like to do that, but this patch here brings back the
> alloc_page_vma() in between them, and I don't have any evidence to
> reassure us that I'm not then pessimizing a readahead case by doing
> unnecessary allocation and free. Leave it for some other time perhaps.
>
> mm/swap_state.c | 52 +++++++++++++++++++++++++---------------------
> 1 file changed, 29 insertions(+), 23 deletions(-)
>
> --- 5.7-rc6-mm1/mm/swap_state.c 2020-05-20 12:21:56.149694170 -0700
> +++ linux/mm/swap_state.c 2020-05-21 20:17:50.188773901 -0700
> @@ -392,56 +392,62 @@ struct page *__read_swap_cache_async(swp
> return NULL;
>
> /*
> + * Get a new page to read into from swap. Allocate it now,
> + * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will
> + * cause any racers to loop around until we add it to cache.
> + */
> + page = alloc_page_vma(gfp_mask, vma, addr);
> + if (!page)
> + return NULL;
> +
> + /*
> * Swap entry may have been freed since our caller observed it.
> */
> err = swapcache_prepare(entry);
> if (!err)
> break;
>
> - if (err == -EEXIST) {
> - /*
> - * We might race against get_swap_page() and stumble
> - * across a SWAP_HAS_CACHE swap_map entry whose page
> - * has not been brought into the swapcache yet.
> - */
> - cond_resched();
> - continue;
> - }
> + put_page(page);
> + if (err != -EEXIST)
> + return NULL;
>
> - return NULL;
> + /*
> + * We might race against __delete_from_swap_cache(), and
> + * stumble across a swap_map entry whose SWAP_HAS_CACHE
> + * has not yet been cleared. Or race against another
> + * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> + * in swap_map, but not yet added its page to swap cache.
> + */
> + cond_resched();
> }
>
> /*
> - * The swap entry is ours to swap in. Prepare a new page.
> + * The swap entry is ours to swap in. Prepare the new page.
> */
>
> - page = alloc_page_vma(gfp_mask, vma, addr);
> - if (!page)
> - goto fail_free;
> -
> __SetPageLocked(page);
> __SetPageSwapBacked(page);
>
> /* May fail (-ENOMEM) if XArray node allocation failed. */
> - if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
> + if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
> + put_swap_page(page, entry);
> goto fail_unlock;
> + }
>
> - if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL))
> - goto fail_delete;
> + if (mem_cgroup_charge(page, NULL, gfp_mask)) {
> + delete_from_swap_cache(page);
> + goto fail_unlock;
> + }
>
> - /* Initiate read into locked page */
> + /* Caller will initiate read into locked page */
> SetPageWorkingset(page);
> lru_cache_add_anon(page);
> *new_page_allocated = true;
> return page;
>
> -fail_delete:
> - delete_from_swap_cache(page);
> fail_unlock:
> unlock_page(page);
> put_page(page);
> -fail_free:
> - swap_free(entry);
> return NULL;
> }
>
>
Acked-by: Rafael Aquini <aquini@redhat.com>
next prev parent reply other threads:[~2020-05-22 17:08 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-22 5:56 Hugh Dickins
2020-05-22 17:08 ` Rafael Aquini [this message]
2020-05-23 0:24 ` Andrew Morton
2020-05-26 15:45 ` Johannes Weiner
2020-05-27 21:44 ` Hugh Dickins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200522170844.GA85134@optiplex-fbsd \
--to=aquini@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@linux.alibaba.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox