From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6D897CF34D4
	for <linux-mm@archiver.kernel.org>; Thu,  3 Oct 2024 23:03:35 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 07F2C6B03E6; Thu,  3 Oct 2024 19:03:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 02ED76B03E7; Thu,  3 Oct 2024 19:03:34 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E38F26B03E8; Thu,  3 Oct 2024 19:03:34 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id C61906B03E6
	for <linux-mm@kvack.org>; Thu,  3 Oct 2024 19:03:34 -0400 (EDT)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 62DC8160A32
	for <linux-mm@kvack.org>; Thu,  3 Oct 2024 23:03:34 +0000 (UTC)
X-FDA: 82633819548.12.DC891BE
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
	by imf06.hostedemail.com (Postfix) with ESMTP id 6775018000C
	for <linux-mm@kvack.org>; Thu,  3 Oct 2024 23:03:32 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=LStVOEnr;
	dmarc=pass (policy=quarantine) header.from=kernel.org;
	spf=pass (imf06.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727996516; a=rsa-sha256;
	cv=none;
	b=faGEOMrQAK9QEJXTfXPLYmFNGedfcSMmy/dkLsrL+ex68W46JJiWoONj8Uio2IDyHhZMLS
	wBJEdgvOfuh/wkduJVievYRoPuCAZmaxEngJ6hSzp2pUvHVsBl5M+TGGeEpZCqjvTT8evb
	dSJFS85DK6ohx1DZ2SQwO7lstnSXplE=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=LStVOEnr;
	dmarc=pass (policy=quarantine) header.from=kernel.org;
	spf=pass (imf06.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1727996516;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ZneyBLsPbynJhdmxoSLrJ6R/hmC5HlDMYI8WI1zWel8=;
	b=uyGkfV6u0BdCKdltSFGySR4ne2sOAK99djsq6X5SHhLdtkXqColuIC8HGN81jY3/9o9psh
	Xl0FMQ12Lt2o7lMYZrhavq1ARs93UG1QJuybEcx5PAdU0CphBIQqZ1mFHZqjLGcHiAPMJn
	f84VZ+MNimwKeSFepbO1HGT5f7pQClI=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by dfw.source.kernel.org (Postfix) with ESMTP id 5EAB25C5BCC
	for <linux-mm@kvack.org>; Thu,  3 Oct 2024 23:03:27 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3170DC4CEC7
	for <linux-mm@kvack.org>; Thu,  3 Oct 2024 23:03:31 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1727996611;
	bh=KFzex9+R/I8Pzhswj0CneoSGb7/RTwQAinaseCqg4Ug=;
	h=References:In-Reply-To:From:Date:Subject:To:Cc:From;
	b=LStVOEnr5rPNMp/dKqmhhEFGl7VZ+PDRbowLZ367vOsblwr1DhigBbWDpGlhP8e1q
	 yh4K2gjnk5rGVIBASkrz4rRZslWx8Qi3u+cMUEwl/x3AFEor+SbRVGq+CKA9ZOZ9yH
	 YmBzPRlSJWaTBYHSTJw+P+oua8FwRYNrbwmO0CuwhYzEp+VqzKQnYpCsiOWlPBAUNR
	 9RwelW/L1sMWNqDqW6ML/jvvrYXFTnrGLRneJ+RoaXBQSYzS5cBi6YhMhKMJstfh8F
	 uFWZmdFinnWRayy8akk4G59iE8UghGpWQQqSstlmLtFFx3pFDH1TGgwWOiF4+CrVqv
	 3/BM7s2PlCl6Q==
Received: by mail-pf1-f173.google.com with SMTP id d2e1a72fcca58-71ddb6199c4so833905b3a.0
        for <linux-mm@kvack.org>; Thu, 03 Oct 2024 16:03:31 -0700 (PDT)
X-Forwarded-Encrypted: i=1; AJvYcCUAjr7qzfQg2hL4sZQXyL/DaXckxbSLjt9+7gUdU0SlXzYRUHjRtt7iw5CAFWoAN+uK01IIG4nxfg==@kvack.org
X-Gm-Message-State: AOJu0Yw9yae6ALpINN61WgVvvB/hFIfYj8p+KNPnOPct7EzuZfghSdAq
	L/FVUuxRmp+DgpEF8FYyj+h1/Ll7TLX6eex67p8nphYpd62rLOi/abe5kJOIcgf5GHRry5F3EkJ
	BHKnmY4bzQh7BTZPak8eKwwRnmQ==
X-Google-Smtp-Source: AGHT+IGYuAhj5sasoOt3KrZSfKkfm4oAKmJAO76UhBhXcWgfhKLC5p0Kj0HY6Bl/qVbeSvOqy5Wgxgwk8awTddS5dyM=
X-Received: by 2002:a05:6a00:2ea2:b0:714:3831:ec91 with SMTP id
 d2e1a72fcca58-71de2463512mr942880b3a.25.1727996610760; Thu, 03 Oct 2024
 16:03:30 -0700 (PDT)
MIME-Version: 1.0
References: <87y137nxqs.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <20241002015754.969-1-21cnbao@gmail.com> <87ikuani1f.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87ikuani1f.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Chris Li <chrisl@kernel.org>
Date: Thu, 3 Oct 2024 16:03:17 -0700
X-Gmail-Original-Message-ID: <CANeU7Qn3f=HYiuuU5AL_WDYUy6fLJcqgj6+fPO=xVSxbB_DBQg@mail.gmail.com>
Message-ID: <CANeU7Qn3f=HYiuuU5AL_WDYUy6fLJcqgj6+fPO=xVSxbB_DBQg@mail.gmail.com>
Subject: Re: [PATCH] mm: avoid unconditional one-tick sleep when
 swapcache_prepare fails
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, david@redhat.com, 
	hannes@cmpxchg.org, hughd@google.com, kaleshsingh@google.com, 
	kasong@tencent.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, 
	liyangouwen1@oppo.com, mhocko@suse.com, minchan@kernel.org, sj@kernel.org, 
	stable@vger.kernel.org, surenb@google.com, v-songbaohua@oppo.com, 
	willy@infradead.org, yosryahmed@google.com, yuzhao@google.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 6775018000C
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: y3nogt4s45s7ouf7esy15ukq4y14p8nu
X-HE-Tag: 1727996612-76497
X-HE-Meta: U2FsdGVkX1+RjP1QP9GIpIU0s8kBnZ4qkfmCMhNtHWjpuwpfxeZuWynf1Ilwd4PfbYkoYgTMP0IjmXmG+e20WfWP3J+KVOeDWv1dO+mBb6N2dRsC2BVA/T2AAMk5CSBWKAAGbe0q6Zx6QLI6dbdpmdSZXk+NaO7p28MR5YyE6JpUKnpCybk+QBC/5kXnFnuf1zuQsg8eYV+4LYFhmBCCnevMDgiw35Fh/JoLpMYU2TDDpJAUZZ7/OC5VqD7QF3QXK3MNXJgCbtk+qo2q9f/MVkVE7GLevy5IVVmAW3mfpBHOUdL5N04gjcTsv2skWbseR4G3qW+iWESdcQo+zZIfVUytd/UD3fmW2w1O5wW8NNJjlRRSwtMz0JsfplQ3oToT7nKOmmvtmagjSZkL5AXCB1Hd3eZsCJxf9AJdtBXe/Bvds9zd444Al/2neRieBu/umlZtQr6lgG5KE2xnEUwzpZGXTu4NmgBCqYmQrweHXN6R28+qnuSZVuYCFz10uuyyXi6+JrZBgH8jvdSzSEGAF/QN/MvSxeAFUEf5++BP+PR7HJ1b/F8TMtzP97DmZXJohY+6YA1H36SDAAPSAllhuYzAA1Vp3x6waxZpvs6arFuFwXniQJRhvvhpThYM2LN3MivxsiA52bVzzppZFnrJebuCyO6ZZ+N3EFTzzV8fRwA3n1wbH6Y41+Rr4yxrHRp2ZP5f3/WfPkfs53QIRiL0C5y1+fUdkFBkaE8cOgFSoovEWLB8S1FqwV+2a/d7Excp2nddWqyhqVGO53qXSaU1RjA/b58kAaTVuNiAlKpLiDTv7pVGQy1Y9mpYhgQSNvm3en4yEoe3//Q1E9lxEtVq7huh34a+BzuBAdbEOapvfvYRUAbgdC/dd7rzUN4o3ysLoZHElR9KhZ6W7aAVHjhGKQ9xxf4H+tRM1JQmhmagx3KOR+vzBFBvG+ogoOIQ9+7wtQkiflYRvSxSO2hYlmF
 Qke3omQj
 joYeFWUwt0FSacRLoGyI5kgPJq09IEPrWxJfZkIj9JgUVM/AMhCsGZ20rykFIKgPUPQFXeZeR1o2we9Aqs/Lxg5I6liNUVdtEp4ft5dxV0i8lluooHcKqEFhMtYVvjTtFQHfGPTXZ7G3HZnm/YqksJ0a3wncNnyXEMYgPeISxoZxBVAL0NLhBbnTehFGlQ8epz/r9YHUTAQKRjXbA9srUhMELbZNePQ2o1I3TG1sZ8/upAtF3Sz/st405Pqg2C9oeUvYj9hGbbY2aUsqjX7/2n6j2o8JEf9xTy7P+HppygnkGUgJZ83e2p5i80LUrY20RRUrUjdWKMNQp5fD8b2jz8CG2i0aNpZy0bRZ8Sq6gY+GLCLgGitFdt5Ds3CnHd9iQQbiHsKyh+1WyYZXtqD5oto6JwpcgML3XygkI3JIf5viaHmhQ7IjTAWlnHQYiow4Pmf5jJiIWRWADd3btHmu3JqaAHTWh3g/gbPzSehue3EONM/4=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Oct 2, 2024 at 5:35=E2=80=AFPM Huang, Ying <ying.huang@intel.com> w=
rote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Oct 2, 2024 at 8:43=E2=80=AFAM Huang, Ying <ying.huang@intel.co=
m> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Tue, Oct 1, 2024 at 7:43=E2=80=AFAM Huang, Ying <ying.huang@intel=
.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Sun, Sep 29, 2024 at 3:43=E2=80=AFPM Huang, Ying <ying.huang@i=
ntel.com> wrote:
> >> >> >>
> >> >> >> Hi, Barry,
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >>
> >> >> >> > From: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >
> >> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcach=
e")
> >> >> >> > introduced an unconditional one-tick sleep when `swapcache_pre=
pare()`
> >> >> >> > fails, which has led to reports of UI stuttering on latency-se=
nsitive
> >> >> >> > Android devices. To address this, we can use a waitqueue to wa=
ke up
> >> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of alway=
s
> >> >> >> > sleeping for a full tick. While tasks may occasionally be woke=
n by an
> >> >> >> > unrelated `do_swap_page()`, this method is preferable to two s=
cenarios:
> >> >> >> > rapid re-entry into page faults, which can cause livelocks, an=
d
> >> >> >> > multiple millisecond sleeps, which visibly degrade user experi=
ence.
> >> >> >>
> >> >> >> In general, I think that this works.  Why not extend the solutio=
n to
> >> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_as=
ync()
> >> >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To av=
oid
> >> >> >
> >> >> > Hi Ying,
> >> >> > Thanks for your comments.
> >> >> > I feel extending the solution to __read_swap_cache_async() should=
 be done
> >> >> > in a separate patch. On phones, I've never encountered any issues=
 reported
> >> >> > on that path, so it might be better suited for an optimization ra=
ther than a
> >> >> > hotfix?
> >> >>
> >> >> Yes.  It's fine to do that in another patch as optimization.
> >> >
> >> > Ok. I'll prepare a separate patch for optimizing that path.
> >>
> >> Thanks!
> >>
> >> >>
> >> >> >> overhead to call wake_up() when there's no task waiting, we can =
use an
> >> >> >> atomic to count waiting tasks.
> >> >> >
> >> >> > I'm not sure it's worth adding the complexity, as wake_up() on an=
 empty
> >> >> > waitqueue should have a very low cost on its own?
> >> >>
> >> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a gl=
obal
> >> >> shared lock.  On systems with many CPUs (such servers), this may ca=
use
> >> >> severe lock contention.  Even the cache ping-pong may hurt performa=
nce
> >> >> much.
> >> >
> >> > I understand that cache synchronization was a significant issue befo=
re
> >> > qspinlock, but it seems to be less of a concern after its implementa=
tion.
> >>
> >> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
> >> discussed in the following thread.
> >>
> >> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programmin=
g.kicks-ass.net/
> >>
> >> > However, using a global atomic variable would still trigger cache br=
oadcasts,
> >> > correct?
> >>
> >> We can only change the atomic variable to non-zero when
> >> swapcache_prepare() returns non-zero, and call wake_up() when the atom=
ic
> >> variable is non-zero.  Because swapcache_prepare() returns 0 most time=
s,
> >> the atomic variable is 0 most times.  If we don't change the value of
> >> atomic variable, cache ping-pong will not be triggered.
> >
> > yes. this can be implemented by adding another atomic variable.
>
> Just realized that we don't need another atomic variable for this, just
> use waitqueue_active() before wake_up() should be enough.
>
> >>
> >> Hi, Kairui,
> >>
> >> Do you have some test cases to test parallel zram swap-in?  If so, tha=
t
> >> can be used to verify whether cache ping-pong is an issue and whether =
it
> >> can be fixed via a global atomic variable.
> >>
> >
> > Yes, Kairui please run a test on your machine with lots of cores before
> > and after adding a global atomic variable as suggested by Ying. I am
> > sorry I don't have a server machine.
> >
> > if it turns out you find cache ping-pong can be an issue, another
> > approach would be a waitqueue hash:
>
> Yes.  waitqueue hash may help reduce lock contention.  And, we can have
> both waitqueue_active() and waitqueue hash if necessary.  As the first
> step, waitqueue_active() appears simpler.

Interesting. Just take a look at the waitqueue_active(), it requires
smp_mb() if using without holding the lock.
Quote from the comment of waitqueue_active():
* Also note that this 'optimization' trades a spin_lock() for an smp_mb(),
 * which (when the lock is uncontended) are of roughly equal cost.

Chris

>
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2366578015ad..aae0e532d8b6 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4192,6 +4192,23 @@ static struct folio *alloc_swap_folio(struct vm_=
fault *vmf)
> >  }
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +/*
> > + * Alleviating the 'thundering herd' phenomenon using a waitqueue hash
> > + * when multiple do_swap_page() operations occur simultaneously.
> > + */
> > +#define SWAPCACHE_WAIT_TABLE_BITS 5
> > +#define SWAPCACHE_WAIT_TABLE_SIZE (1 << SWAPCACHE_WAIT_TABLE_BITS)
> > +static wait_queue_head_t swapcache_wqs[SWAPCACHE_WAIT_TABLE_SIZE];
> > +
> > +static int __init swapcache_wqs_init(void)
> > +{
> > +     for (int i =3D 0; i < SWAPCACHE_WAIT_TABLE_SIZE; i++)
> > +             init_waitqueue_head(&swapcache_wqs[i]);
> > +
> > +        return 0;
> > +}
> > +late_initcall(swapcache_wqs_init);
> > +
> >  /*
> >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >   * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -4204,6 +4221,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  {
> >       struct vm_area_struct *vma =3D vmf->vma;
> >       struct folio *swapcache, *folio =3D NULL;
> > +     DECLARE_WAITQUEUE(wait, current);
> > +     wait_queue_head_t *swapcache_wq;
> >       struct page *page;
> >       struct swap_info_struct *si =3D NULL;
> >       rmap_t rmap_flags =3D RMAP_NONE;
> > @@ -4297,12 +4316,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                                * undetectable as pte_same() returns tru=
e due
> >                                * to entry reuse.
> >                                */
> > +                             swapcache_wq =3D &swapcache_wqs[hash_long=
(vmf->address & PMD_MASK,
> > +                                                     SWAPCACHE_WAIT_TA=
BLE_BITS)];
> >                               if (swapcache_prepare(entry, nr_pages)) {
> >                                       /*
> >                                        * Relax a bit to prevent rapid
> >                                        * repeated page faults.
> >                                        */
> > +                                     add_wait_queue(swapcache_wq, &wai=
t);
> >                                       schedule_timeout_uninterruptible(=
1);
> > +                                     remove_wait_queue(swapcache_wq, &=
wait);
> >                                       goto out_page;
> >                               }
> >                               need_clear_cache =3D true;
> > @@ -4609,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  out:
> >       /* Clear the swap cache pin for direct swapin after PTL unlock */
> > -     if (need_clear_cache)
> > +     if (need_clear_cache) {
> >               swapcache_clear(si, entry, nr_pages);
> > +             wake_up(swapcache_wq);
> > +     }
> >       if (si)
> >               put_swap_device(si);
> >       return ret;
> > @@ -4625,8 +4650,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               folio_unlock(swapcache);
> >               folio_put(swapcache);
> >       }
> > -     if (need_clear_cache)
> > +     if (need_clear_cache) {
> >               swapcache_clear(si, entry, nr_pages);
> > +             wake_up(swapcache_wq);
> > +     }
> >       if (si)
> >               put_swap_device(si);
> >       return ret;
>
> --
> Best Regards,
> Huang, Ying