Re: [PATCH v3 2/2] mm: rid swapoff of quadratic complexity

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Vineeth Pillai <vpillai@digitalocean.com>
To: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Huang Ying <ying.huang@intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Kelley Nielsen <kelleynnn@gmail.com>,
	Rik van Riel <riel@surriel.com>
Subject: Re: [PATCH v3 2/2] mm: rid swapoff of quadratic complexity
Date: Tue, 1 Jan 2019 13:24:15 -0500	[thread overview]
Message-ID: <CANaguZAStuiXpk2S0rYwdn3Zzsoakavaps4RzSRVqMs3wZ49qg@mail.gmail.com> (raw)
In-Reply-To: <alpine.LSU.2.11.1812311635590.4106@eggly.anvils>

[-- Attachment #1: Type: text/plain, Size: 9302 bytes --]

Thanks a lot for the fixes and detailed explanation Hugh! I shall fold all
the changes from you and Huang in the next iteration.

Thanks for all the suggestions and comments as well. I am looking into all
those and will include all the changes in the next version. Will discuss
over mail in case of any clarifications.

Thanks again!
~Vineeth

On Mon, Dec 31, 2018 at 7:44 PM Hugh Dickins <hughd@google.com> wrote:

> On Mon, 3 Dec 2018, Vineeth Remanan Pillai wrote:
>
> > This patch was initially posted by Kelley(kelleynnn@gmail.com).
> > Reposting the patch with all review comments addressed and with minor
> > modifications and optimizations. Tests were rerun and commit message
> > updated with new results.
> >
> > The function try_to_unuse() is of quadratic complexity, with a lot of
> > wasted effort. It unuses swap entries one by one, potentially iterating
> > over all the page tables for all the processes in the system for each
> > one.
> >
> > This new proposed implementation of try_to_unuse simplifies its
> > complexity to linear. It iterates over the system's mms once, unusing
> > all the affected entries as it walks each set of page tables. It also
> > makes similar changes to shmem_unuse.
>
> Hi Vineeth, please fold in fixes below before reposting your
> "mm,swap: rid swapoff of quadratic complexity" patch -
> or ask for more detail if unclear.  I could split it up,
> of course, but since they should all (except perhaps one)
> just be merged into the base patch before going any further,
> it'll save me time to keep them together here and just explain:-
>
> shmem_unuse_swap_entries():
> If a user fault races with swapoff, it's very normal for
> shmem_swapin_page() to return -EEXIST, and the old code was
> careful not to pass back any error but -ENOMEM; whereas on mmotm,
> /usr/sbin/swapoff often failed silently because it got that EEXIST.
>
> shmem_unuse():
> A couple of crashing bugs there: a list_del_init without holding the
> mutex, and too much faith in the "safe" in list_for_each_entry_safe():
> it does assume that the mutex has been held throughout, you (very
> nicely!) drop it, but that does require "next" to be re-evaluated.
>
> shmem_writepage():
> Not a bug fix, this is the "except perhaps one": minor optimization,
> could be left out, but if shmem_unuse() is going through the list
> in the forward direction, and may completely unswap a file and del
> it from the list, then pages from that file can be swapped out to
> *other* swap areas after that, and it be reinserted in the list:
> better to reinsert it behind shmem_unuse()'s cursor than in front
> of it, which would entail a second pointless pass over that file.
>
> try_to_unuse():
> Moved up the assignment of "oldi = i" (and changed the test to
> "oldi <= i"), so as not to get trapped in that find_next_to_unuse()
> loop when find_get_page() does not find it.
>
> try_to_unuse():
> But the main problem was passing entry.val to find_get_page() there:
> that used to be correct, but since f6ab1f7f6b2d we need to pass just
> the offset - as it stood, it could only find the pages when swapping
> off area 0 (a similar issue was fixed in shmem_replace_page() recently).
> That (together with the late oldi assignment) was why my swapoffs were
> hanging on SWAP_HAS_CACHE swap_map entries.
>
> With those changes, it all seems to work rather well, and is a nice
> simplification of the source, in addition to removing the quadratic
> complexity. To my great surprise, the KSM pages are already handled
> fairly well - the ksm_might_need_to_copy() that has long been in
> unuse_pte() turns out to do (almost) a good enough job already,
> so most users of KSM and swapoff would never see any problem.
> And I'd been afraid of swapin readahead causing spurious -ENOMEMs,
> but have seen nothing of that in practice (though something else
> in mmotm does appear to use up more memory than before).
>
> My remaining criticisms would be:
>
> As Huang Ying pointed out in other mail, there is a danger of
> livelock (or rather, hitting the MAX_RETRIES limit) when a multiply
> mapped page (most especially a KSM page, whose mappings are not
> likely to be nearby in the mmlist) is swapped out then partially
> swapped off then some ptes swapped back out.  As indeed the
> "Under global memory pressure" comment admits.
>
> I have hit the MAX_RETRIES 3 limit several times in load testing,
> not investigated but I presume due to such a multiply mapped page,
> so at present we do have a regression there.  A very simple answer
> would be to remove the retries limiting - perhaps you just added
> that to get around the find_get_page() failure before it was
> understood?  That does then tend towards the livelock alternative,
> but you've kept the signal_pending() check, so there's still the
> same way out as the old technique had (but greater likelihood of
> needing it with the new technique).  The right fix will be to do
> an rmap walk to unuse all the swap entries of a single anon_vma
> while holding page lock (with KSM needing that page force-deleted
> from swap cache before moving on); but none of us have written
> that code yet, maybe just removing the retries limit good enough.
>
> Two dislikes on the code structure, probably one solution: the
> "goto retry", up two levels from inside the lower loop, is easy to
> misunderstand; and the oldi business is ugly - find_next_to_unuse()
> was written to wrap around continuously to suit the old loop, but
> now it's left with its "++i >= max" code to achieve that, then your
> "i <= oldi" code to detect when it did, to undo that again: please
> delete code from both ends to make that all simpler.
>
> I'd expect to see checks on inuse_pages in some places, to terminate
> the scans as soon as possible (swapoff of an unused swapfile should
> be very quick, shouldn't it? not requiring any scans at all); but it
> looks like the old code did not have those either - was inuse_pages
> unreliable once upon a time? is it unreliable now?
>
> Hugh
>
> ---
>
>  mm/shmem.c    |   12 ++++++++----
>  mm/swapfile.c |    8 ++++----
>  2 files changed, 12 insertions(+), 8 deletions(-)
>
> --- mmotm/mm/shmem.c    2018-12-22 13:32:51.339584848 -0800
> +++ linux/mm/shmem.c    2018-12-31 12:30:55.822407154 -0800
> @@ -1149,6 +1149,7 @@ static int shmem_unuse_swap_entries(stru
>                 }
>                 if (error == -ENOMEM)
>                         break;
> +               error = 0;
>         }
>         return error;
>  }
> @@ -1216,12 +1217,15 @@ int shmem_unuse(unsigned int type)
>                 mutex_unlock(&shmem_swaplist_mutex);
>                 if (prev_inode)
>                         iput(prev_inode);
> +               prev_inode = inode;
> +
>                 error = shmem_unuse_inode(inode, type);
> -               if (!info->swapped)
> -                       list_del_init(&info->swaplist);
>                 cond_resched();
> -               prev_inode = inode;
> +
>                 mutex_lock(&shmem_swaplist_mutex);
> +               next = list_next_entry(info, swaplist);
> +               if (!info->swapped)
> +                       list_del_init(&info->swaplist);
>                 if (error)
>                         break;
>         }
> @@ -1313,7 +1317,7 @@ static int shmem_writepage(struct page *
>          */
>         mutex_lock(&shmem_swaplist_mutex);
>         if (list_empty(&info->swaplist))
> -               list_add_tail(&info->swaplist, &shmem_swaplist);
> +               list_add(&info->swaplist, &shmem_swaplist);
>
>         if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
>                 spin_lock_irq(&info->lock);
> diff -purN mmotm/mm/swapfile.c linux/mm/swapfile.c
> --- mmotm/mm/swapfile.c 2018-12-22 13:32:51.347584880 -0800
> +++ linux/mm/swapfile.c 2018-12-31 12:30:55.822407154 -0800
> @@ -2156,7 +2156,7 @@ retry:
>
>         while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
>                 /*
> -                * under global memory pressure, swap entries
> +                * Under global memory pressure, swap entries
>                  * can be reinserted back into process space
>                  * after the mmlist loop above passes over them.
>                  * This loop will then repeat fruitlessly,
> @@ -2164,7 +2164,7 @@ retry:
>                  * but doing nothing to actually free up the swap.
>                  * In this case, go over the mmlist loop again.
>                  */
> -               if (i < oldi) {
> +               if (i <= oldi) {
>                         retries++;
>                         if (retries > MAX_RETRIES) {
>                                 retval = -EBUSY;
> @@ -2172,8 +2172,9 @@ retry:
>                         }
>                         goto retry;
>                 }
> +               oldi = i;
>                 entry = swp_entry(type, i);
> -               page = find_get_page(swap_address_space(entry), entry.val);
> +               page = find_get_page(swap_address_space(entry), i);
>                 if (!page)
>                         continue;
>
> @@ -2188,7 +2189,6 @@ retry:
>                 try_to_free_swap(page);
>                 unlock_page(page);
>                 put_page(page);
> -               oldi = i;
>         }
>  out:
>         return retval;
>

[-- Attachment #2: Type: text/html, Size: 11517 bytes --]

next prev parent reply	other threads:[~2019-01-01 18:24 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-03 17:09 [PATCH v3 1/2] mm: Refactor swap-in logic out of shmem_getpage_gfp Vineeth Remanan Pillai
2018-12-03 17:09 ` [PATCH v3 2/2] mm: rid swapoff of quadratic complexity Vineeth Remanan Pillai
2019-01-01  0:44   ` Hugh Dickins
2019-01-01  0:44     ` Hugh Dickins
2019-01-01 18:24     ` Vineeth Pillai [this message]
2019-01-01 18:24       ` Vineeth Pillai
2019-01-02  4:16       ` Hugh Dickins
2019-01-02  4:16         ` Hugh Dickins
2019-01-02 17:47         ` Vineeth Pillai
2019-01-02 17:47           ` Vineeth Pillai
2019-01-02 19:43           ` Hugh Dickins
2019-01-02 19:43             ` Hugh Dickins
2019-01-02 20:05             ` Vineeth Pillai
2019-01-02 20:05               ` Vineeth Pillai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANaguZAStuiXpk2S0rYwdn3Zzsoakavaps4RzSRVqMs3wZ49qg@mail.gmail.com \
    --to=vpillai@digitalocean.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kelleynnn@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@surriel.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox