From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi1-f199.google.com (mail-oi1-f199.google.com [209.85.167.199]) by kanga.kvack.org (Postfix) with ESMTP id 8DC1B8E0002 for ; Tue, 1 Jan 2019 13:24:28 -0500 (EST) Received: by mail-oi1-f199.google.com with SMTP id p131so21148768oia.21 for ; Tue, 01 Jan 2019 10:24:28 -0800 (PST) Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id f14sor22436682oib.5.2019.01.01.10.24.27 for (Google Transport Security); Tue, 01 Jan 2019 10:24:27 -0800 (PST) MIME-Version: 1.0 References: <20181203170934.16512-1-vpillai@digitalocean.com> <20181203170934.16512-2-vpillai@digitalocean.com> In-Reply-To: From: Vineeth Pillai Date: Tue, 1 Jan 2019 13:24:15 -0500 Message-ID: Subject: Re: [PATCH v3 2/2] mm: rid swapoff of quadratic complexity Content-Type: multipart/alternative; boundary="000000000000b1832f057e69a39c" Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Matthew Wilcox , Andrew Morton , Huang Ying , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Kelley Nielsen , Rik van Riel --000000000000b1832f057e69a39c Content-Type: text/plain; charset="UTF-8" Thanks a lot for the fixes and detailed explanation Hugh! I shall fold all the changes from you and Huang in the next iteration. Thanks for all the suggestions and comments as well. I am looking into all those and will include all the changes in the next version. Will discuss over mail in case of any clarifications. Thanks again! ~Vineeth On Mon, Dec 31, 2018 at 7:44 PM Hugh Dickins wrote: > On Mon, 3 Dec 2018, Vineeth Remanan Pillai wrote: > > > This patch was initially posted by Kelley(kelleynnn@gmail.com). > > Reposting the patch with all review comments addressed and with minor > > modifications and optimizations. Tests were rerun and commit message > > updated with new results. > > > > The function try_to_unuse() is of quadratic complexity, with a lot of > > wasted effort. It unuses swap entries one by one, potentially iterating > > over all the page tables for all the processes in the system for each > > one. > > > > This new proposed implementation of try_to_unuse simplifies its > > complexity to linear. It iterates over the system's mms once, unusing > > all the affected entries as it walks each set of page tables. It also > > makes similar changes to shmem_unuse. > > Hi Vineeth, please fold in fixes below before reposting your > "mm,swap: rid swapoff of quadratic complexity" patch - > or ask for more detail if unclear. I could split it up, > of course, but since they should all (except perhaps one) > just be merged into the base patch before going any further, > it'll save me time to keep them together here and just explain:- > > shmem_unuse_swap_entries(): > If a user fault races with swapoff, it's very normal for > shmem_swapin_page() to return -EEXIST, and the old code was > careful not to pass back any error but -ENOMEM; whereas on mmotm, > /usr/sbin/swapoff often failed silently because it got that EEXIST. > > shmem_unuse(): > A couple of crashing bugs there: a list_del_init without holding the > mutex, and too much faith in the "safe" in list_for_each_entry_safe(): > it does assume that the mutex has been held throughout, you (very > nicely!) drop it, but that does require "next" to be re-evaluated. > > shmem_writepage(): > Not a bug fix, this is the "except perhaps one": minor optimization, > could be left out, but if shmem_unuse() is going through the list > in the forward direction, and may completely unswap a file and del > it from the list, then pages from that file can be swapped out to > *other* swap areas after that, and it be reinserted in the list: > better to reinsert it behind shmem_unuse()'s cursor than in front > of it, which would entail a second pointless pass over that file. > > try_to_unuse(): > Moved up the assignment of "oldi = i" (and changed the test to > "oldi <= i"), so as not to get trapped in that find_next_to_unuse() > loop when find_get_page() does not find it. > > try_to_unuse(): > But the main problem was passing entry.val to find_get_page() there: > that used to be correct, but since f6ab1f7f6b2d we need to pass just > the offset - as it stood, it could only find the pages when swapping > off area 0 (a similar issue was fixed in shmem_replace_page() recently). > That (together with the late oldi assignment) was why my swapoffs were > hanging on SWAP_HAS_CACHE swap_map entries. > > With those changes, it all seems to work rather well, and is a nice > simplification of the source, in addition to removing the quadratic > complexity. To my great surprise, the KSM pages are already handled > fairly well - the ksm_might_need_to_copy() that has long been in > unuse_pte() turns out to do (almost) a good enough job already, > so most users of KSM and swapoff would never see any problem. > And I'd been afraid of swapin readahead causing spurious -ENOMEMs, > but have seen nothing of that in practice (though something else > in mmotm does appear to use up more memory than before). > > My remaining criticisms would be: > > As Huang Ying pointed out in other mail, there is a danger of > livelock (or rather, hitting the MAX_RETRIES limit) when a multiply > mapped page (most especially a KSM page, whose mappings are not > likely to be nearby in the mmlist) is swapped out then partially > swapped off then some ptes swapped back out. As indeed the > "Under global memory pressure" comment admits. > > I have hit the MAX_RETRIES 3 limit several times in load testing, > not investigated but I presume due to such a multiply mapped page, > so at present we do have a regression there. A very simple answer > would be to remove the retries limiting - perhaps you just added > that to get around the find_get_page() failure before it was > understood? That does then tend towards the livelock alternative, > but you've kept the signal_pending() check, so there's still the > same way out as the old technique had (but greater likelihood of > needing it with the new technique). The right fix will be to do > an rmap walk to unuse all the swap entries of a single anon_vma > while holding page lock (with KSM needing that page force-deleted > from swap cache before moving on); but none of us have written > that code yet, maybe just removing the retries limit good enough. > > Two dislikes on the code structure, probably one solution: the > "goto retry", up two levels from inside the lower loop, is easy to > misunderstand; and the oldi business is ugly - find_next_to_unuse() > was written to wrap around continuously to suit the old loop, but > now it's left with its "++i >= max" code to achieve that, then your > "i <= oldi" code to detect when it did, to undo that again: please > delete code from both ends to make that all simpler. > > I'd expect to see checks on inuse_pages in some places, to terminate > the scans as soon as possible (swapoff of an unused swapfile should > be very quick, shouldn't it? not requiring any scans at all); but it > looks like the old code did not have those either - was inuse_pages > unreliable once upon a time? is it unreliable now? > > Hugh > > --- > > mm/shmem.c | 12 ++++++++---- > mm/swapfile.c | 8 ++++---- > 2 files changed, 12 insertions(+), 8 deletions(-) > > --- mmotm/mm/shmem.c 2018-12-22 13:32:51.339584848 -0800 > +++ linux/mm/shmem.c 2018-12-31 12:30:55.822407154 -0800 > @@ -1149,6 +1149,7 @@ static int shmem_unuse_swap_entries(stru > } > if (error == -ENOMEM) > break; > + error = 0; > } > return error; > } > @@ -1216,12 +1217,15 @@ int shmem_unuse(unsigned int type) > mutex_unlock(&shmem_swaplist_mutex); > if (prev_inode) > iput(prev_inode); > + prev_inode = inode; > + > error = shmem_unuse_inode(inode, type); > - if (!info->swapped) > - list_del_init(&info->swaplist); > cond_resched(); > - prev_inode = inode; > + > mutex_lock(&shmem_swaplist_mutex); > + next = list_next_entry(info, swaplist); > + if (!info->swapped) > + list_del_init(&info->swaplist); > if (error) > break; > } > @@ -1313,7 +1317,7 @@ static int shmem_writepage(struct page * > */ > mutex_lock(&shmem_swaplist_mutex); > if (list_empty(&info->swaplist)) > - list_add_tail(&info->swaplist, &shmem_swaplist); > + list_add(&info->swaplist, &shmem_swaplist); > > if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) { > spin_lock_irq(&info->lock); > diff -purN mmotm/mm/swapfile.c linux/mm/swapfile.c > --- mmotm/mm/swapfile.c 2018-12-22 13:32:51.347584880 -0800 > +++ linux/mm/swapfile.c 2018-12-31 12:30:55.822407154 -0800 > @@ -2156,7 +2156,7 @@ retry: > > while ((i = find_next_to_unuse(si, i, frontswap)) != 0) { > /* > - * under global memory pressure, swap entries > + * Under global memory pressure, swap entries > * can be reinserted back into process space > * after the mmlist loop above passes over them. > * This loop will then repeat fruitlessly, > @@ -2164,7 +2164,7 @@ retry: > * but doing nothing to actually free up the swap. > * In this case, go over the mmlist loop again. > */ > - if (i < oldi) { > + if (i <= oldi) { > retries++; > if (retries > MAX_RETRIES) { > retval = -EBUSY; > @@ -2172,8 +2172,9 @@ retry: > } > goto retry; > } > + oldi = i; > entry = swp_entry(type, i); > - page = find_get_page(swap_address_space(entry), entry.val); > + page = find_get_page(swap_address_space(entry), i); > if (!page) > continue; > > @@ -2188,7 +2189,6 @@ retry: > try_to_free_swap(page); > unlock_page(page); > put_page(page); > - oldi = i; > } > out: > return retval; > --000000000000b1832f057e69a39c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks a lot for the fixes and detailed=C2=A0ex= planation Hugh! I shall fold all the changes from you and Huang in the next= iteration.

Thanks for all the suggest= ions and comments as well. I am looking into all those and will include all= the changes in the next=C2=A0version. Will discuss over mail in case of an= y clarifications.

Thanks again!
<= div class=3D"gmail_default" style=3D"font-family:verdana,sans-serif;font-si= ze:small">~Vineeth

On Mon, Dec 31, 2018 at 7:44 PM Hugh Dickins <hughd@google.com> wrote:
On Mon, 3 Dec 2018, Vineeth Remanan Pillai wro= te:

> This patch was initially posted by Kelley(kelleynnn@gmail.com).
> Reposting the patch with all review comments addressed and with minor<= br> > modifications and optimizations. Tests were rerun and commit message > updated with new results.
>
> The function try_to_unuse() is of quadratic complexity, with a lot of<= br> > wasted effort. It unuses swap entries one by one, potentially iteratin= g
> over all the page tables for all the processes in the system for each<= br> > one.
>
> This new proposed implementation of try_to_unuse simplifies its
> complexity to linear. It iterates over the system's mms once, unus= ing
> all the affected entries as it walks each set of page tables. It also<= br> > makes similar changes to shmem_unuse.

Hi Vineeth, please fold in fixes below before reposting your
"mm,swap: rid swapoff of quadratic complexity" patch -
or ask for more detail if unclear.=C2=A0 I could split it up,
of course, but since they should all (except perhaps one)
just be merged into the base patch before going any further,
it'll save me time to keep them together here and just explain:-

shmem_unuse_swap_entries():
If a user fault races with swapoff, it's very normal for
shmem_swapin_page() to return -EEXIST, and the old code was
careful not to pass back any error but -ENOMEM; whereas on mmotm,
/usr/sbin/swapoff often failed silently because it got that EEXIST.

shmem_unuse():
A couple of crashing bugs there: a list_del_init without holding the
mutex, and too much faith in the "safe" in list_for_each_entry_sa= fe():
it does assume that the mutex has been held throughout, you (very
nicely!) drop it, but that does require "next" to be re-evaluated= .

shmem_writepage():
Not a bug fix, this is the "except perhaps one": minor optimizati= on,
could be left out, but if shmem_unuse() is going through the list
in the forward direction, and may completely unswap a file and del
it from the list, then pages from that file can be swapped out to
*other* swap areas after that, and it be reinserted in the list:
better to reinsert it behind shmem_unuse()'s cursor than in front
of it, which would entail a second pointless pass over that file.

try_to_unuse():
Moved up the assignment of "oldi =3D i" (and changed the test to<= br> "oldi <=3D i"), so as not to get trapped in that find_next_to_= unuse()
loop when find_get_page() does not find it.

try_to_unuse():
But the main problem was passing entry.val to find_get_page() there:
that used to be correct, but since f6ab1f7f6b2d we need to pass just
the offset - as it stood, it could only find the pages when swapping
off area 0 (a similar issue was fixed in shmem_replace_page() recently). That (together with the late oldi assignment) was why my swapoffs were
hanging on SWAP_HAS_CACHE swap_map entries.

With those changes, it all seems to work rather well, and is a nice
simplification of the source, in addition to removing the quadratic
complexity. To my great surprise, the KSM pages are already handled
fairly well - the ksm_might_need_to_copy() that has long been in
unuse_pte() turns out to do (almost) a good enough job already,
so most users of KSM and swapoff would never see any problem.
And I'd been afraid of swapin readahead causing spurious -ENOMEMs,
but have seen nothing of that in practice (though something else
in mmotm does appear to use up more memory than before).

My remaining criticisms would be:

As Huang Ying pointed out in other mail, there is a danger of
livelock (or rather, hitting the MAX_RETRIES limit) when a multiply
mapped page (most especially a KSM page, whose mappings are not
likely to be nearby in the mmlist) is swapped out then partially
swapped off then some ptes swapped back out.=C2=A0 As indeed the
"Under global memory pressure" comment admits.

I have hit the MAX_RETRIES 3 limit several times in load testing,
not investigated but I presume due to such a multiply mapped page,
so at present we do have a regression there.=C2=A0 A very simple answer
would be to remove the retries limiting - perhaps you just added
that to get around the find_get_page() failure before it was
understood?=C2=A0 That does then tend towards the livelock alternative,
but you've kept the signal_pending() check, so there's still the same way out as the old technique had (but greater likelihood of
needing it with the new technique).=C2=A0 The right fix will be to do
an rmap walk to unuse all the swap entries of a single anon_vma
while holding page lock (with KSM needing that page force-deleted
from swap cache before moving on); but none of us have written
that code yet, maybe just removing the retries limit good enough.

Two dislikes on the code structure, probably one solution: the
"goto retry", up two levels from inside the lower loop, is easy t= o
misunderstand; and the oldi business is ugly - find_next_to_unuse()
was written to wrap around continuously to suit the old loop, but
now it's left with its "++i >=3D max" code to achieve that= , then your
"i <=3D oldi" code to detect when it did, to undo that again: = please
delete code from both ends to make that all simpler.

I'd expect to see checks on inuse_pages in some places, to terminate the scans as soon as possible (swapoff of an unused swapfile should
be very quick, shouldn't it? not requiring any scans at all); but it looks like the old code did not have those either - was inuse_pages
unreliable once upon a time? is it unreliable now?

Hugh

---

=C2=A0mm/shmem.c=C2=A0 =C2=A0 |=C2=A0 =C2=A012 ++++++++----
=C2=A0mm/swapfile.c |=C2=A0 =C2=A0 8 ++++----
=C2=A02 files changed, 12 insertions(+), 8 deletions(-)

--- mmotm/mm/shmem.c=C2=A0 =C2=A0 2018-12-22 13:32:51.339584848 -0800
+++ linux/mm/shmem.c=C2=A0 =C2=A0 2018-12-31 12:30:55.822407154 -0800
@@ -1149,6 +1149,7 @@ static int shmem_unuse_swap_entries(stru
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (error =3D=3D -E= NOMEM)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 break;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0error =3D 0;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return error;
=C2=A0}
@@ -1216,12 +1217,15 @@ int shmem_unuse(unsigned int type)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_unlock(&s= hmem_swaplist_mutex);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (prev_inode)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 iput(prev_inode);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0prev_inode =3D inod= e;
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 error =3D shmem_unu= se_inode(inode, type);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!info->swapp= ed)
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0list_del_init(&info->swaplist);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cond_resched();
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0prev_inode =3D inod= e;
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_lock(&shm= em_swaplist_mutex);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0next =3D list_next_= entry(info, swaplist);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!info->swapp= ed)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0list_del_init(&info->swaplist);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (error)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 break;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
@@ -1313,7 +1317,7 @@ static int shmem_writepage(struct page *
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_lock(&shmem_swaplist_mutex);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (list_empty(&info->swaplist))
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0list_add_tail(&= info->swaplist, &shmem_swaplist);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0list_add(&info-= >swaplist, &shmem_swaplist);

=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (add_to_swap_cache(page, swap, GFP_ATOMIC) = =3D=3D 0) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_lock_irq(&= info->lock);
diff -purN mmotm/mm/swapfile.c linux/mm/swapfile.c
--- mmotm/mm/swapfile.c 2018-12-22 13:32:51.347584880 -0800
+++ linux/mm/swapfile.c 2018-12-31 12:30:55.822407154 -0800
@@ -2156,7 +2156,7 @@ retry:

=C2=A0 =C2=A0 =C2=A0 =C2=A0 while ((i =3D find_next_to_unuse(si, i, frontsw= ap)) !=3D 0) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * under global mem= ory pressure, swap entries
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * Under global mem= ory pressure, swap entries
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* can be rein= serted back into process space
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* after the m= mlist loop above passes over them.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* This loop w= ill then repeat fruitlessly,
@@ -2164,7 +2164,7 @@ retry:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* but doing n= othing to actually free up the swap.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* In this cas= e, go over the mmlist loop again.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (i < oldi) {<= br> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (i <=3D oldi)= {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 retries++;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 if (retries > MAX_RETRIES) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 retval =3D -EBUSY;
@@ -2172,8 +2172,9 @@ retry:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 goto retry;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0oldi =3D i;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 entry =3D swp_entry= (type, i);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0page =3D find_get_p= age(swap_address_space(entry), entry.val);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0page =3D find_get_p= age(swap_address_space(entry), i);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (!page)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 continue;

@@ -2188,7 +2189,6 @@ retry:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 try_to_free_swap(pa= ge);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 unlock_page(page);<= br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 put_page(page);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0oldi =3D i;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0out:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return retval;
--000000000000b1832f057e69a39c-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.2 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23229C43387 for ; Tue, 1 Jan 2019 18:24:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 80D3D2070D for ; Tue, 1 Jan 2019 18:24:29 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=digitalocean.com header.i=@digitalocean.com header.b="XLSahxC5" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 80D3D2070D Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=digitalocean.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DFA668E0003; Tue, 1 Jan 2019 13:24:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D84928E0002; Tue, 1 Jan 2019 13:24:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C4C5E8E0003; Tue, 1 Jan 2019 13:24:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from mail-oi1-f199.google.com (mail-oi1-f199.google.com [209.85.167.199]) by kanga.kvack.org (Postfix) with ESMTP id 8DC1B8E0002 for ; Tue, 1 Jan 2019 13:24:28 -0500 (EST) Received: by mail-oi1-f199.google.com with SMTP id p131so21148768oia.21 for ; Tue, 01 Jan 2019 10:24:28 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:mime-version:references :in-reply-to:from:date:message-id:subject:to:cc; bh=pJYt0RxgXOCe/ybJeMdXwQsKsfjKjnJOY2sU7SvUMcU=; b=LcEc8Cnh0AjekZY4g+sByGtdPfp2+R7/3UeS+jyutVdUyYKgxV/7PRjAwYucF7BgSi 4+ALrjbgxadaWWFDjEyZl1GCUDTGTTOd6IeCGBPy789jFaVJ1BWrHOjvmUBqqcRt5EwG EJEGFcy5P30GcNS2tVS7wYrUmM1AbNrnTf1O2YKlK99HXxkGuwMML8I3lLlydytCGEXr 8xY2xvW6JyH3eeCJLH7z0yoTClckP66amSabtKo40C+a0lBqPB9hHIYuL/0c1ecO45cx mdtz+d7RmZ3ugM7dqN2/ZzJtky81lbo3tSMYyRVXal2EyKyv0mt0elIsYQ1cQ3hC6qJr tVoQ== X-Gm-Message-State: AA+aEWbSlIO/rm1mOjsQ6DfeaAPbLFB0YITrnRbRgdHmyIyRDKgFcxAz W6Ws1TDhkQu/qwwNgEzEpqKeequXkyVCZQZ1h+1+V4JBg9Ruw1hWDjWFQkbHpRFDeag1CCV5R19 w8d0Digb7opkPX8xekAQL8yTQGlqhWMqWjbmjn9JCATkBCKM/REF4G3QhYvn1ZkillcdR6FQsS+ ybcdnJQJ96Pu3J3wUdZOrUpyzAmHFS8EAcW/ShqnM4V91XhtcHYziQez01OwtnhMfGEL2Pm7igX Fosz9Iwt2OWJltLZOzZYUTAiV4QS+G20JOV2J9f1FCsUwP183mCVdYFHCT/bAkXoP7MA+MNql0E u1rVMAx8y76AqpAuRoxHuo1lbf+UfI+K/QuIL0WX6o6JbBInb1q4OLExtP71bh2eBekvpCO6YVe u X-Received: by 2002:aca:6952:: with SMTP id e79mr28860788oic.117.1546367068112; Tue, 01 Jan 2019 10:24:28 -0800 (PST) X-Received: by 2002:aca:6952:: with SMTP id e79mr28860758oic.117.1546367067112; Tue, 01 Jan 2019 10:24:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546367067; cv=none; d=google.com; s=arc-20160816; b=iyvZcvtDocrzZBLpoPruvd9n/ud3ilS5kKQdhOtABt2qO4jylZ1kSt9BjtGnc1igTm uY42l+tAsmGnmNPSVdTvtdZJ8fTdPahn3zBkmrQC1D9KVIXc9qMpY69fIpNLYIF5KlCF yf1xVtCybpq4wuFeCUl+TdnBvOnEVwzszNnY8i6D10Bdm/SfyHlUyApMrzi+5rrqqM6f Bl4sL8RuIapBcL1ICLmO1jrZb6h0Lh5X1L30yuuVgf97TkBEmMCDtq8RzQ59xHsrURRL uB4Lv/XBweLn0vYGajjfYrQ0Aw310Jdjkny01OhRf+BmR0/09Q02++Firjy5ZahZp7NM U8KA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=pJYt0RxgXOCe/ybJeMdXwQsKsfjKjnJOY2sU7SvUMcU=; b=BlAvFuVmRfgdEu0ivFyJilAxpSRXOzXXspW8lpQ/S+Huwny4JxypPymKHNtkqEDMfh YexMWhEHTfgrlcngvM7MfqZrNizK/s54FFDoSHfpUNa4FEn79cdyuvLNEg0UWjVl7prG HdH4K6XsZo08PkoTkFgqCeQWxCij0YFLXfWh3qfae0av18XDrY1yAEzYh0kP9iIfHHYg PT7HeK9Kx6wumlwj3ok6nha/m7VakvEUuloZVYwAz0BGiaF4daZ+UKgsRvaZdGVorWDu qjddbLEyJFrK0WJxJ9tIVAI26UXcn58BOyE9AHAhT4LK+SF67hF/bnwkQXvMU0U3or0E S/FQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@digitalocean.com header.s=google header.b=XLSahxC5; spf=pass (google.com: domain of vpillai@digitalocean.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=vpillai@digitalocean.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=digitalocean.com Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id f14sor22436682oib.5.2019.01.01.10.24.27 for (Google Transport Security); Tue, 01 Jan 2019 10:24:27 -0800 (PST) Received-SPF: pass (google.com: domain of vpillai@digitalocean.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@digitalocean.com header.s=google header.b=XLSahxC5; spf=pass (google.com: domain of vpillai@digitalocean.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=vpillai@digitalocean.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=digitalocean.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=digitalocean.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=pJYt0RxgXOCe/ybJeMdXwQsKsfjKjnJOY2sU7SvUMcU=; b=XLSahxC5ux6tWPDhJWLRtdbI8kSHCUhR9bwd+gQKYXf6A0ay0W/zcgd26FGo7rNt+2 69oY9Y6lfKeRkicyOp9SOYZ39RtXFTRIHg35idYLRaUbFc1bZ95vbz/OYeP/LAsGMVUX 32P20lQ1gFt6OLt5C/zsL7lHul1neXo7F068E= X-Google-Smtp-Source: AFSGD/W3N4wWt0Y3VTVBAGmVLavO85QxJpUH0nS7so8TrHx3Vg3up9RAemc7KgWCWslPL9P/cDJasedXYyrQgRanz14= X-Received: by 2002:aca:195:: with SMTP id 143mr26099574oib.322.1546367066532; Tue, 01 Jan 2019 10:24:26 -0800 (PST) MIME-Version: 1.0 References: <20181203170934.16512-1-vpillai@digitalocean.com> <20181203170934.16512-2-vpillai@digitalocean.com> In-Reply-To: From: Vineeth Pillai Date: Tue, 1 Jan 2019 13:24:15 -0500 Message-ID: Subject: Re: [PATCH v3 2/2] mm: rid swapoff of quadratic complexity To: Hugh Dickins Cc: Matthew Wilcox , Andrew Morton , Huang Ying , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Kelley Nielsen , Rik van Riel Content-Type: multipart/alternative; boundary="000000000000b1832f057e69a39c" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Message-ID: <20190101182415.INqIqDEnAl3qMMQsGzeXRf-R16-hfKy3WbXY35uBwc0@z> --000000000000b1832f057e69a39c Content-Type: text/plain; charset="UTF-8" Thanks a lot for the fixes and detailed explanation Hugh! I shall fold all the changes from you and Huang in the next iteration. Thanks for all the suggestions and comments as well. I am looking into all those and will include all the changes in the next version. Will discuss over mail in case of any clarifications. Thanks again! ~Vineeth On Mon, Dec 31, 2018 at 7:44 PM Hugh Dickins wrote: > On Mon, 3 Dec 2018, Vineeth Remanan Pillai wrote: > > > This patch was initially posted by Kelley(kelleynnn@gmail.com). > > Reposting the patch with all review comments addressed and with minor > > modifications and optimizations. Tests were rerun and commit message > > updated with new results. > > > > The function try_to_unuse() is of quadratic complexity, with a lot of > > wasted effort. It unuses swap entries one by one, potentially iterating > > over all the page tables for all the processes in the system for each > > one. > > > > This new proposed implementation of try_to_unuse simplifies its > > complexity to linear. It iterates over the system's mms once, unusing > > all the affected entries as it walks each set of page tables. It also > > makes similar changes to shmem_unuse. > > Hi Vineeth, please fold in fixes below before reposting your > "mm,swap: rid swapoff of quadratic complexity" patch - > or ask for more detail if unclear. I could split it up, > of course, but since they should all (except perhaps one) > just be merged into the base patch before going any further, > it'll save me time to keep them together here and just explain:- > > shmem_unuse_swap_entries(): > If a user fault races with swapoff, it's very normal for > shmem_swapin_page() to return -EEXIST, and the old code was > careful not to pass back any error but -ENOMEM; whereas on mmotm, > /usr/sbin/swapoff often failed silently because it got that EEXIST. > > shmem_unuse(): > A couple of crashing bugs there: a list_del_init without holding the > mutex, and too much faith in the "safe" in list_for_each_entry_safe(): > it does assume that the mutex has been held throughout, you (very > nicely!) drop it, but that does require "next" to be re-evaluated. > > shmem_writepage(): > Not a bug fix, this is the "except perhaps one": minor optimization, > could be left out, but if shmem_unuse() is going through the list > in the forward direction, and may completely unswap a file and del > it from the list, then pages from that file can be swapped out to > *other* swap areas after that, and it be reinserted in the list: > better to reinsert it behind shmem_unuse()'s cursor than in front > of it, which would entail a second pointless pass over that file. > > try_to_unuse(): > Moved up the assignment of "oldi = i" (and changed the test to > "oldi <= i"), so as not to get trapped in that find_next_to_unuse() > loop when find_get_page() does not find it. > > try_to_unuse(): > But the main problem was passing entry.val to find_get_page() there: > that used to be correct, but since f6ab1f7f6b2d we need to pass just > the offset - as it stood, it could only find the pages when swapping > off area 0 (a similar issue was fixed in shmem_replace_page() recently). > That (together with the late oldi assignment) was why my swapoffs were > hanging on SWAP_HAS_CACHE swap_map entries. > > With those changes, it all seems to work rather well, and is a nice > simplification of the source, in addition to removing the quadratic > complexity. To my great surprise, the KSM pages are already handled > fairly well - the ksm_might_need_to_copy() that has long been in > unuse_pte() turns out to do (almost) a good enough job already, > so most users of KSM and swapoff would never see any problem. > And I'd been afraid of swapin readahead causing spurious -ENOMEMs, > but have seen nothing of that in practice (though something else > in mmotm does appear to use up more memory than before). > > My remaining criticisms would be: > > As Huang Ying pointed out in other mail, there is a danger of > livelock (or rather, hitting the MAX_RETRIES limit) when a multiply > mapped page (most especially a KSM page, whose mappings are not > likely to be nearby in the mmlist) is swapped out then partially > swapped off then some ptes swapped back out. As indeed the > "Under global memory pressure" comment admits. > > I have hit the MAX_RETRIES 3 limit several times in load testing, > not investigated but I presume due to such a multiply mapped page, > so at present we do have a regression there. A very simple answer > would be to remove the retries limiting - perhaps you just added > that to get around the find_get_page() failure before it was > understood? That does then tend towards the livelock alternative, > but you've kept the signal_pending() check, so there's still the > same way out as the old technique had (but greater likelihood of > needing it with the new technique). The right fix will be to do > an rmap walk to unuse all the swap entries of a single anon_vma > while holding page lock (with KSM needing that page force-deleted > from swap cache before moving on); but none of us have written > that code yet, maybe just removing the retries limit good enough. > > Two dislikes on the code structure, probably one solution: the > "goto retry", up two levels from inside the lower loop, is easy to > misunderstand; and the oldi business is ugly - find_next_to_unuse() > was written to wrap around continuously to suit the old loop, but > now it's left with its "++i >= max" code to achieve that, then your > "i <= oldi" code to detect when it did, to undo that again: please > delete code from both ends to make that all simpler. > > I'd expect to see checks on inuse_pages in some places, to terminate > the scans as soon as possible (swapoff of an unused swapfile should > be very quick, shouldn't it? not requiring any scans at all); but it > looks like the old code did not have those either - was inuse_pages > unreliable once upon a time? is it unreliable now? > > Hugh > > --- > > mm/shmem.c | 12 ++++++++---- > mm/swapfile.c | 8 ++++---- > 2 files changed, 12 insertions(+), 8 deletions(-) > > --- mmotm/mm/shmem.c 2018-12-22 13:32:51.339584848 -0800 > +++ linux/mm/shmem.c 2018-12-31 12:30:55.822407154 -0800 > @@ -1149,6 +1149,7 @@ static int shmem_unuse_swap_entries(stru > } > if (error == -ENOMEM) > break; > + error = 0; > } > return error; > } > @@ -1216,12 +1217,15 @@ int shmem_unuse(unsigned int type) > mutex_unlock(&shmem_swaplist_mutex); > if (prev_inode) > iput(prev_inode); > + prev_inode = inode; > + > error = shmem_unuse_inode(inode, type); > - if (!info->swapped) > - list_del_init(&info->swaplist); > cond_resched(); > - prev_inode = inode; > + > mutex_lock(&shmem_swaplist_mutex); > + next = list_next_entry(info, swaplist); > + if (!info->swapped) > + list_del_init(&info->swaplist); > if (error) > break; > } > @@ -1313,7 +1317,7 @@ static int shmem_writepage(struct page * > */ > mutex_lock(&shmem_swaplist_mutex); > if (list_empty(&info->swaplist)) > - list_add_tail(&info->swaplist, &shmem_swaplist); > + list_add(&info->swaplist, &shmem_swaplist); > > if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) { > spin_lock_irq(&info->lock); > diff -purN mmotm/mm/swapfile.c linux/mm/swapfile.c > --- mmotm/mm/swapfile.c 2018-12-22 13:32:51.347584880 -0800 > +++ linux/mm/swapfile.c 2018-12-31 12:30:55.822407154 -0800 > @@ -2156,7 +2156,7 @@ retry: > > while ((i = find_next_to_unuse(si, i, frontswap)) != 0) { > /* > - * under global memory pressure, swap entries > + * Under global memory pressure, swap entries > * can be reinserted back into process space > * after the mmlist loop above passes over them. > * This loop will then repeat fruitlessly, > @@ -2164,7 +2164,7 @@ retry: > * but doing nothing to actually free up the swap. > * In this case, go over the mmlist loop again. > */ > - if (i < oldi) { > + if (i <= oldi) { > retries++; > if (retries > MAX_RETRIES) { > retval = -EBUSY; > @@ -2172,8 +2172,9 @@ retry: > } > goto retry; > } > + oldi = i; > entry = swp_entry(type, i); > - page = find_get_page(swap_address_space(entry), entry.val); > + page = find_get_page(swap_address_space(entry), i); > if (!page) > continue; > > @@ -2188,7 +2189,6 @@ retry: > try_to_free_swap(page); > unlock_page(page); > put_page(page); > - oldi = i; > } > out: > return retval; > --000000000000b1832f057e69a39c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks a lot for the fixes and detailed=C2=A0ex= planation Hugh! I shall fold all the changes from you and Huang in the next= iteration.

Thanks for all the suggest= ions and comments as well. I am looking into all those and will include all= the changes in the next=C2=A0version. Will discuss over mail in case of an= y clarifications.

Thanks again!
<= div class=3D"gmail_default" style=3D"font-family:verdana,sans-serif;font-si= ze:small">~Vineeth

On Mon, Dec 31, 2018 at 7:44 PM Hugh Dickins <hughd@google.com> wrote:
On Mon, 3 Dec 2018, Vineeth Remanan Pillai wro= te:

> This patch was initially posted by Kelley(kelleynnn@gmail.com).
> Reposting the patch with all review comments addressed and with minor<= br> > modifications and optimizations. Tests were rerun and commit message > updated with new results.
>
> The function try_to_unuse() is of quadratic complexity, with a lot of<= br> > wasted effort. It unuses swap entries one by one, potentially iteratin= g
> over all the page tables for all the processes in the system for each<= br> > one.
>
> This new proposed implementation of try_to_unuse simplifies its
> complexity to linear. It iterates over the system's mms once, unus= ing
> all the affected entries as it walks each set of page tables. It also<= br> > makes similar changes to shmem_unuse.

Hi Vineeth, please fold in fixes below before reposting your
"mm,swap: rid swapoff of quadratic complexity" patch -
or ask for more detail if unclear.=C2=A0 I could split it up,
of course, but since they should all (except perhaps one)
just be merged into the base patch before going any further,
it'll save me time to keep them together here and just explain:-

shmem_unuse_swap_entries():
If a user fault races with swapoff, it's very normal for
shmem_swapin_page() to return -EEXIST, and the old code was
careful not to pass back any error but -ENOMEM; whereas on mmotm,
/usr/sbin/swapoff often failed silently because it got that EEXIST.

shmem_unuse():
A couple of crashing bugs there: a list_del_init without holding the
mutex, and too much faith in the "safe" in list_for_each_entry_sa= fe():
it does assume that the mutex has been held throughout, you (very
nicely!) drop it, but that does require "next" to be re-evaluated= .

shmem_writepage():
Not a bug fix, this is the "except perhaps one": minor optimizati= on,
could be left out, but if shmem_unuse() is going through the list
in the forward direction, and may completely unswap a file and del
it from the list, then pages from that file can be swapped out to
*other* swap areas after that, and it be reinserted in the list:
better to reinsert it behind shmem_unuse()'s cursor than in front
of it, which would entail a second pointless pass over that file.

try_to_unuse():
Moved up the assignment of "oldi =3D i" (and changed the test to<= br> "oldi <=3D i"), so as not to get trapped in that find_next_to_= unuse()
loop when find_get_page() does not find it.

try_to_unuse():
But the main problem was passing entry.val to find_get_page() there:
that used to be correct, but since f6ab1f7f6b2d we need to pass just
the offset - as it stood, it could only find the pages when swapping
off area 0 (a similar issue was fixed in shmem_replace_page() recently). That (together with the late oldi assignment) was why my swapoffs were
hanging on SWAP_HAS_CACHE swap_map entries.

With those changes, it all seems to work rather well, and is a nice
simplification of the source, in addition to removing the quadratic
complexity. To my great surprise, the KSM pages are already handled
fairly well - the ksm_might_need_to_copy() that has long been in
unuse_pte() turns out to do (almost) a good enough job already,
so most users of KSM and swapoff would never see any problem.
And I'd been afraid of swapin readahead causing spurious -ENOMEMs,
but have seen nothing of that in practice (though something else
in mmotm does appear to use up more memory than before).

My remaining criticisms would be:

As Huang Ying pointed out in other mail, there is a danger of
livelock (or rather, hitting the MAX_RETRIES limit) when a multiply
mapped page (most especially a KSM page, whose mappings are not
likely to be nearby in the mmlist) is swapped out then partially
swapped off then some ptes swapped back out.=C2=A0 As indeed the
"Under global memory pressure" comment admits.

I have hit the MAX_RETRIES 3 limit several times in load testing,
not investigated but I presume due to such a multiply mapped page,
so at present we do have a regression there.=C2=A0 A very simple answer
would be to remove the retries limiting - perhaps you just added
that to get around the find_get_page() failure before it was
understood?=C2=A0 That does then tend towards the livelock alternative,
but you've kept the signal_pending() check, so there's still the same way out as the old technique had (but greater likelihood of
needing it with the new technique).=C2=A0 The right fix will be to do
an rmap walk to unuse all the swap entries of a single anon_vma
while holding page lock (with KSM needing that page force-deleted
from swap cache before moving on); but none of us have written
that code yet, maybe just removing the retries limit good enough.

Two dislikes on the code structure, probably one solution: the
"goto retry", up two levels from inside the lower loop, is easy t= o
misunderstand; and the oldi business is ugly - find_next_to_unuse()
was written to wrap around continuously to suit the old loop, but
now it's left with its "++i >=3D max" code to achieve that= , then your
"i <=3D oldi" code to detect when it did, to undo that again: = please
delete code from both ends to make that all simpler.

I'd expect to see checks on inuse_pages in some places, to terminate the scans as soon as possible (swapoff of an unused swapfile should
be very quick, shouldn't it? not requiring any scans at all); but it looks like the old code did not have those either - was inuse_pages
unreliable once upon a time? is it unreliable now?

Hugh

---

=C2=A0mm/shmem.c=C2=A0 =C2=A0 |=C2=A0 =C2=A012 ++++++++----
=C2=A0mm/swapfile.c |=C2=A0 =C2=A0 8 ++++----
=C2=A02 files changed, 12 insertions(+), 8 deletions(-)

--- mmotm/mm/shmem.c=C2=A0 =C2=A0 2018-12-22 13:32:51.339584848 -0800
+++ linux/mm/shmem.c=C2=A0 =C2=A0 2018-12-31 12:30:55.822407154 -0800
@@ -1149,6 +1149,7 @@ static int shmem_unuse_swap_entries(stru
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (error =3D=3D -E= NOMEM)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 break;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0error =3D 0;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return error;
=C2=A0}
@@ -1216,12 +1217,15 @@ int shmem_unuse(unsigned int type)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_unlock(&s= hmem_swaplist_mutex);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (prev_inode)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 iput(prev_inode);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0prev_inode =3D inod= e;
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 error =3D shmem_unu= se_inode(inode, type);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!info->swapp= ed)
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0list_del_init(&info->swaplist);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cond_resched();
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0prev_inode =3D inod= e;
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_lock(&shm= em_swaplist_mutex);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0next =3D list_next_= entry(info, swaplist);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!info->swapp= ed)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0list_del_init(&info->swaplist);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (error)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 break;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
@@ -1313,7 +1317,7 @@ static int shmem_writepage(struct page *
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_lock(&shmem_swaplist_mutex);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (list_empty(&info->swaplist))
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0list_add_tail(&= info->swaplist, &shmem_swaplist);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0list_add(&info-= >swaplist, &shmem_swaplist);

=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (add_to_swap_cache(page, swap, GFP_ATOMIC) = =3D=3D 0) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_lock_irq(&= info->lock);
diff -purN mmotm/mm/swapfile.c linux/mm/swapfile.c
--- mmotm/mm/swapfile.c 2018-12-22 13:32:51.347584880 -0800
+++ linux/mm/swapfile.c 2018-12-31 12:30:55.822407154 -0800
@@ -2156,7 +2156,7 @@ retry:

=C2=A0 =C2=A0 =C2=A0 =C2=A0 while ((i =3D find_next_to_unuse(si, i, frontsw= ap)) !=3D 0) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * under global mem= ory pressure, swap entries
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * Under global mem= ory pressure, swap entries
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* can be rein= serted back into process space
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* after the m= mlist loop above passes over them.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* This loop w= ill then repeat fruitlessly,
@@ -2164,7 +2164,7 @@ retry:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* but doing n= othing to actually free up the swap.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* In this cas= e, go over the mmlist loop again.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (i < oldi) {<= br> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (i <=3D oldi)= {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 retries++;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 if (retries > MAX_RETRIES) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 retval =3D -EBUSY;
@@ -2172,8 +2172,9 @@ retry:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 goto retry;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0oldi =3D i;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 entry =3D swp_entry= (type, i);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0page =3D find_get_p= age(swap_address_space(entry), entry.val);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0page =3D find_get_p= age(swap_address_space(entry), i);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (!page)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 continue;

@@ -2188,7 +2189,6 @@ retry:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 try_to_free_swap(pa= ge);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 unlock_page(page);<= br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 put_page(page);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0oldi =3D i;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0out:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return retval;
--000000000000b1832f057e69a39c--