linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosryahmed@google.com>
To: Chris Li <chrisl@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Nhat Pham <nphamcs@gmail.com>,
	Chengming Zhou <zhouchengming@bytedance.com>,
	 Huang Ying <ying.huang@intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()
Date: Thu, 25 Jan 2024 14:33:59 -0800	[thread overview]
Message-ID: <CAJD7tkYFX98TZGiiBwXRtiJM_zEphzLq-vNXNyrzun8gTRLuGw@mail.gmail.com> (raw)
In-Reply-To: <CAF8kJuOFbRYZiFmtrAAqh7KBxWNaYtK10e7Ych4VxDKOocRKEQ@mail.gmail.com>

On Thu, Jan 25, 2024 at 2:31 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Yosry,
>
> On Thu, Jan 25, 2024 at 12:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Jan 25, 2024 at 10:55 AM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > Hi Yosry,
> > >
> > > On Wed, Jan 24, 2024 at 11:59 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Wed, Jan 24, 2024 at 9:29 PM Chris Li <chriscli@google.com> wrote:
> > > > >
> > > > > Hi Yosry,
> > > > >
> > > > > On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > >
> > > > > > >
> > > > > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :)
> > > > > > >
> > > > > > > The first solution that came to mind for me was refcounting the zswap
> > > > > > > tree with RCU with percpu-refcount, similar to how cgroup refs are
> > > > > > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think
> > > > > > > the percpu-refcount may be an overkill in terms of memory usage
> > > > > > > though. I think we can still do our own refcounting with RCU, but it
> > > > > > > may be more complicated.
> > > > > >
> > > > > > FWIW, I was able to reproduce the problem in a vm with the following
> > > > > > kernel diff:
> > > > >
> > > > > Thanks for the great find.
> > > > >
> > > > > I was worry about the usage after free situation in this email:
> > > > >
> > > > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@mail.gmail.com/
> > > > >
> > > > > Glad you are able to find a reproducible case. That is one of the
> > > > > reasons I change the free to invalidate entries in my xarray patch.
> > > > >
> > > > > I think the swap_off code should remove the entry from the tree, just
> > > > > wait for each zswap entry to drop to zero.  Then free it.
> > > >
> > > > This doesn't really help. The swapoff code is already removing all the
> > > > entries from the trees before zswap_swapoff() is called through
> > > > zswap_invalidate(). The race I described occurs because the writeback
> > > > code is accessing the entries through the LRU, not the tree. The
> > >
> > > Why?
> > > Entry not in the tree is fine. What you describe is that, swap_off
> > > code will not see the entry because it is already not in the tree.
> > > The last one holding the reference count would free it.
> > >
> > > > writeback code could have isolated a zswap entry from the LRU before
> > > > swapoff, then tried to access it after swapoff. Although the zswap
> > >
> > > The writeback should have a reference count to the zswap entry it
> > > holds. The entry will not be free until the LRU is done and drop the
> > > reference count to zero.
> > >
> > > > entry itself is referenced and safe to use, accessing the tree to grab
> > > > the tree lock and check if the entry is still in the tree is the
> > > > problem.
> > >
> > > The swap off should wait until all the LRU list from that tree has
> > > been consumed before destroying the tree.
> > > In swap off, it walks all the process MM anyway, walking all the memcg
> > > and finding all the zswap entries in zswap LRU should solve that
> > > problem.
> >
> > At that point, the entry is isolated from the zswap LRU list as well.
> > So even if swap off iterates the zswap LRUs, it cannot find it to wait
>
> It just means that we need to defer removing the entry from LRU, only
> remove it after most of the write back is complete to some critical
> steps.
>
> > for it. Take a closer look at the race condition I described. The
>
> I take a closer look at the sequence Chengming describe, it has the
> element of delay removing entry from the LRU as well.
>
> > problem is that after the entry is isolated from the zswap LRU, we
> > need to grab the tree lock to make sure it's still there and get a
> > ref, and just trying to lock the tree may be a UAF if we race with
> > swapoff.
>
> I feel it is very wrong to have the tree freed while having
> outstanding entry allocationed from the tree pending.
> I would want to avoid that situation if possible.

This should be the case with Chengming's solution.

>
> >
> > > Anyway, I think it is easier to reason about the behavior that way.
> > > Swap off will take the extra hit, but not the normal access of the
> > > tree.
> >
> > I think this adds a lot of unnecessary complexity tbh. I think the
> > operations reordering that Chengming described may be the way to go
> > here. It does not include adding more logic or synchronization
>
> Does not require adding the tree reference count right?

No, just reordering of operations in the writeback path.

>
> > primitives, just reordering operations to be closer to what we do in
> > zswap_load() and leverage existing synchronization.
>
> The complexity is mostly for avoiding the tree reference count. If we
> don't add the tree refcount and we don't need the extra complexity in
> the tree waiting for LRU, that sounds great to me.

I think that's what Chengming's proposal provides.


  reply	other threads:[~2024-01-25 22:34 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-20  2:40 [PATCH 0/2] mm: zswap: simplify zswap_swapoff() Yosry Ahmed
2024-01-20  2:40 ` [PATCH 1/2] mm: swap: update inuse_pages after all cleanups are done Yosry Ahmed
2024-01-22 13:17   ` Chengming Zhou
2024-01-23  8:59   ` Huang, Ying
2024-01-23  9:40     ` Yosry Ahmed
2024-01-23  9:54       ` Yosry Ahmed
2024-01-24  3:13       ` Huang, Ying
2024-01-24  3:20         ` Yosry Ahmed
2024-01-24  3:27           ` Huang, Ying
2024-01-24  4:15             ` Yosry Ahmed
2024-01-20  2:40 ` [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff() Yosry Ahmed
2024-01-22 13:13   ` Chengming Zhou
2024-01-22 20:19   ` Johannes Weiner
2024-01-22 20:39     ` Yosry Ahmed
2024-01-23 15:38       ` Johannes Weiner
2024-01-23 15:54         ` Yosry Ahmed
2024-01-23 20:12           ` Johannes Weiner
2024-01-23 21:02             ` Yosry Ahmed
2024-01-24  6:57               ` Yosry Ahmed
2024-01-25  5:28                 ` Chris Li
2024-01-25  7:59                   ` Yosry Ahmed
2024-01-25 18:55                     ` Chris Li
2024-01-25 20:57                       ` Yosry Ahmed
2024-01-25 22:31                         ` Chris Li
2024-01-25 22:33                           ` Yosry Ahmed [this message]
2024-01-26  1:09                             ` Chris Li
2024-01-24  7:20               ` Chengming Zhou
2024-01-25  5:44                 ` Chris Li
2024-01-25  8:01                   ` Yosry Ahmed
2024-01-25 19:03                     ` Chris Li
2024-01-25 21:01                       ` Yosry Ahmed
2024-01-25  7:53                 ` Yosry Ahmed
2024-01-25  8:03                   ` Yosry Ahmed
2024-01-25  8:30                   ` Chengming Zhou
2024-01-25  8:42                     ` Yosry Ahmed
2024-01-25  8:52                       ` Chengming Zhou
2024-01-25  9:03                         ` Yosry Ahmed
2024-01-25  9:22                           ` Chengming Zhou
2024-01-25  9:26                             ` Yosry Ahmed
2024-01-25  9:38                               ` Chengming Zhou
2024-01-26  0:03                   ` Chengming Zhou
2024-01-26  0:05                     ` Yosry Ahmed
2024-01-26  0:10                       ` Chengming Zhou
2024-01-23 20:30           ` Nhat Pham
2024-01-23 21:04             ` Yosry Ahmed
2024-01-22 21:21   ` Nhat Pham
2024-01-22 22:31   ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJD7tkYFX98TZGiiBwXRtiJM_zEphzLq-vNXNyrzun8gTRLuGw@mail.gmail.com \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ying.huang@intel.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox