From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D897CF34D4 for ; Thu, 3 Oct 2024 23:03:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 07F2C6B03E6; Thu, 3 Oct 2024 19:03:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 02ED76B03E7; Thu, 3 Oct 2024 19:03:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E38F26B03E8; Thu, 3 Oct 2024 19:03:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id C61906B03E6 for ; Thu, 3 Oct 2024 19:03:34 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 62DC8160A32 for ; Thu, 3 Oct 2024 23:03:34 +0000 (UTC) X-FDA: 82633819548.12.DC891BE Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf06.hostedemail.com (Postfix) with ESMTP id 6775018000C for ; Thu, 3 Oct 2024 23:03:32 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=LStVOEnr; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf06.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727996516; a=rsa-sha256; cv=none; b=faGEOMrQAK9QEJXTfXPLYmFNGedfcSMmy/dkLsrL+ex68W46JJiWoONj8Uio2IDyHhZMLS wBJEdgvOfuh/wkduJVievYRoPuCAZmaxEngJ6hSzp2pUvHVsBl5M+TGGeEpZCqjvTT8evb dSJFS85DK6ohx1DZ2SQwO7lstnSXplE= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=LStVOEnr; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf06.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727996516; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZneyBLsPbynJhdmxoSLrJ6R/hmC5HlDMYI8WI1zWel8=; b=uyGkfV6u0BdCKdltSFGySR4ne2sOAK99djsq6X5SHhLdtkXqColuIC8HGN81jY3/9o9psh Xl0FMQ12Lt2o7lMYZrhavq1ARs93UG1QJuybEcx5PAdU0CphBIQqZ1mFHZqjLGcHiAPMJn f84VZ+MNimwKeSFepbO1HGT5f7pQClI= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 5EAB25C5BCC for ; Thu, 3 Oct 2024 23:03:27 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3170DC4CEC7 for ; Thu, 3 Oct 2024 23:03:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1727996611; bh=KFzex9+R/I8Pzhswj0CneoSGb7/RTwQAinaseCqg4Ug=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=LStVOEnr5rPNMp/dKqmhhEFGl7VZ+PDRbowLZ367vOsblwr1DhigBbWDpGlhP8e1q yh4K2gjnk5rGVIBASkrz4rRZslWx8Qi3u+cMUEwl/x3AFEor+SbRVGq+CKA9ZOZ9yH YmBzPRlSJWaTBYHSTJw+P+oua8FwRYNrbwmO0CuwhYzEp+VqzKQnYpCsiOWlPBAUNR 9RwelW/L1sMWNqDqW6ML/jvvrYXFTnrGLRneJ+RoaXBQSYzS5cBi6YhMhKMJstfh8F uFWZmdFinnWRayy8akk4G59iE8UghGpWQQqSstlmLtFFx3pFDH1TGgwWOiF4+CrVqv 3/BM7s2PlCl6Q== Received: by mail-pf1-f173.google.com with SMTP id d2e1a72fcca58-71ddb6199c4so833905b3a.0 for ; Thu, 03 Oct 2024 16:03:31 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCUAjr7qzfQg2hL4sZQXyL/DaXckxbSLjt9+7gUdU0SlXzYRUHjRtt7iw5CAFWoAN+uK01IIG4nxfg==@kvack.org X-Gm-Message-State: AOJu0Yw9yae6ALpINN61WgVvvB/hFIfYj8p+KNPnOPct7EzuZfghSdAq L/FVUuxRmp+DgpEF8FYyj+h1/Ll7TLX6eex67p8nphYpd62rLOi/abe5kJOIcgf5GHRry5F3EkJ BHKnmY4bzQh7BTZPak8eKwwRnmQ== X-Google-Smtp-Source: AGHT+IGYuAhj5sasoOt3KrZSfKkfm4oAKmJAO76UhBhXcWgfhKLC5p0Kj0HY6Bl/qVbeSvOqy5Wgxgwk8awTddS5dyM= X-Received: by 2002:a05:6a00:2ea2:b0:714:3831:ec91 with SMTP id d2e1a72fcca58-71de2463512mr942880b3a.25.1727996610760; Thu, 03 Oct 2024 16:03:30 -0700 (PDT) MIME-Version: 1.0 References: <87y137nxqs.fsf@yhuang6-desk2.ccr.corp.intel.com> <20241002015754.969-1-21cnbao@gmail.com> <87ikuani1f.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87ikuani1f.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Chris Li Date: Thu, 3 Oct 2024 16:03:17 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH] mm: avoid unconditional one-tick sleep when swapcache_prepare fails To: "Huang, Ying" Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, david@redhat.com, hannes@cmpxchg.org, hughd@google.com, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, liyangouwen1@oppo.com, mhocko@suse.com, minchan@kernel.org, sj@kernel.org, stable@vger.kernel.org, surenb@google.com, v-songbaohua@oppo.com, willy@infradead.org, yosryahmed@google.com, yuzhao@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 6775018000C X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: y3nogt4s45s7ouf7esy15ukq4y14p8nu X-HE-Tag: 1727996612-76497 X-HE-Meta: U2FsdGVkX1+RjP1QP9GIpIU0s8kBnZ4qkfmCMhNtHWjpuwpfxeZuWynf1Ilwd4PfbYkoYgTMP0IjmXmG+e20WfWP3J+KVOeDWv1dO+mBb6N2dRsC2BVA/T2AAMk5CSBWKAAGbe0q6Zx6QLI6dbdpmdSZXk+NaO7p28MR5YyE6JpUKnpCybk+QBC/5kXnFnuf1zuQsg8eYV+4LYFhmBCCnevMDgiw35Fh/JoLpMYU2TDDpJAUZZ7/OC5VqD7QF3QXK3MNXJgCbtk+qo2q9f/MVkVE7GLevy5IVVmAW3mfpBHOUdL5N04gjcTsv2skWbseR4G3qW+iWESdcQo+zZIfVUytd/UD3fmW2w1O5wW8NNJjlRRSwtMz0JsfplQ3oToT7nKOmmvtmagjSZkL5AXCB1Hd3eZsCJxf9AJdtBXe/Bvds9zd444Al/2neRieBu/umlZtQr6lgG5KE2xnEUwzpZGXTu4NmgBCqYmQrweHXN6R28+qnuSZVuYCFz10uuyyXi6+JrZBgH8jvdSzSEGAF/QN/MvSxeAFUEf5++BP+PR7HJ1b/F8TMtzP97DmZXJohY+6YA1H36SDAAPSAllhuYzAA1Vp3x6waxZpvs6arFuFwXniQJRhvvhpThYM2LN3MivxsiA52bVzzppZFnrJebuCyO6ZZ+N3EFTzzV8fRwA3n1wbH6Y41+Rr4yxrHRp2ZP5f3/WfPkfs53QIRiL0C5y1+fUdkFBkaE8cOgFSoovEWLB8S1FqwV+2a/d7Excp2nddWqyhqVGO53qXSaU1RjA/b58kAaTVuNiAlKpLiDTv7pVGQy1Y9mpYhgQSNvm3en4yEoe3//Q1E9lxEtVq7huh34a+BzuBAdbEOapvfvYRUAbgdC/dd7rzUN4o3ysLoZHElR9KhZ6W7aAVHjhGKQ9xxf4H+tRM1JQmhmagx3KOR+vzBFBvG+ogoOIQ9+7wtQkiflYRvSxSO2hYlmF Qke3omQj joYeFWUwt0FSacRLoGyI5kgPJq09IEPrWxJfZkIj9JgUVM/AMhCsGZ20rykFIKgPUPQFXeZeR1o2we9Aqs/Lxg5I6liNUVdtEp4ft5dxV0i8lluooHcKqEFhMtYVvjTtFQHfGPTXZ7G3HZnm/YqksJ0a3wncNnyXEMYgPeISxoZxBVAL0NLhBbnTehFGlQ8epz/r9YHUTAQKRjXbA9srUhMELbZNePQ2o1I3TG1sZ8/upAtF3Sz/st405Pqg2C9oeUvYj9hGbbY2aUsqjX7/2n6j2o8JEf9xTy7P+HppygnkGUgJZ83e2p5i80LUrY20RRUrUjdWKMNQp5fD8b2jz8CG2i0aNpZy0bRZ8Sq6gY+GLCLgGitFdt5Ds3CnHd9iQQbiHsKyh+1WyYZXtqD5oto6JwpcgML3XygkI3JIf5viaHmhQ7IjTAWlnHQYiow4Pmf5jJiIWRWADd3btHmu3JqaAHTWh3g/gbPzSehue3EONM/4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 2, 2024 at 5:35=E2=80=AFPM Huang, Ying w= rote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Wed, Oct 2, 2024 at 8:43=E2=80=AFAM Huang, Ying wrote: > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > On Tue, Oct 1, 2024 at 7:43=E2=80=AFAM Huang, Ying wrote: > >> >> > >> >> Barry Song <21cnbao@gmail.com> writes: > >> >> > >> >> > On Sun, Sep 29, 2024 at 3:43=E2=80=AFPM Huang, Ying wrote: > >> >> >> > >> >> >> Hi, Barry, > >> >> >> > >> >> >> Barry Song <21cnbao@gmail.com> writes: > >> >> >> > >> >> >> > From: Barry Song > >> >> >> > > >> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcach= e") > >> >> >> > introduced an unconditional one-tick sleep when `swapcache_pre= pare()` > >> >> >> > fails, which has led to reports of UI stuttering on latency-se= nsitive > >> >> >> > Android devices. To address this, we can use a waitqueue to wa= ke up > >> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of alway= s > >> >> >> > sleeping for a full tick. While tasks may occasionally be woke= n by an > >> >> >> > unrelated `do_swap_page()`, this method is preferable to two s= cenarios: > >> >> >> > rapid re-entry into page faults, which can cause livelocks, an= d > >> >> >> > multiple millisecond sleeps, which visibly degrade user experi= ence. > >> >> >> > >> >> >> In general, I think that this works. Why not extend the solutio= n to > >> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_as= ync() > >> >> >> too? We can call wake_up() when we clear SWAP_HAS_CACHE. To av= oid > >> >> > > >> >> > Hi Ying, > >> >> > Thanks for your comments. > >> >> > I feel extending the solution to __read_swap_cache_async() should= be done > >> >> > in a separate patch. On phones, I've never encountered any issues= reported > >> >> > on that path, so it might be better suited for an optimization ra= ther than a > >> >> > hotfix? > >> >> > >> >> Yes. It's fine to do that in another patch as optimization. > >> > > >> > Ok. I'll prepare a separate patch for optimizing that path. > >> > >> Thanks! > >> > >> >> > >> >> >> overhead to call wake_up() when there's no task waiting, we can = use an > >> >> >> atomic to count waiting tasks. > >> >> > > >> >> > I'm not sure it's worth adding the complexity, as wake_up() on an= empty > >> >> > waitqueue should have a very low cost on its own? > >> >> > >> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a gl= obal > >> >> shared lock. On systems with many CPUs (such servers), this may ca= use > >> >> severe lock contention. Even the cache ping-pong may hurt performa= nce > >> >> much. > >> > > >> > I understand that cache synchronization was a significant issue befo= re > >> > qspinlock, but it seems to be less of a concern after its implementa= tion. > >> > >> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as > >> discussed in the following thread. > >> > >> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programmin= g.kicks-ass.net/ > >> > >> > However, using a global atomic variable would still trigger cache br= oadcasts, > >> > correct? > >> > >> We can only change the atomic variable to non-zero when > >> swapcache_prepare() returns non-zero, and call wake_up() when the atom= ic > >> variable is non-zero. Because swapcache_prepare() returns 0 most time= s, > >> the atomic variable is 0 most times. If we don't change the value of > >> atomic variable, cache ping-pong will not be triggered. > > > > yes. this can be implemented by adding another atomic variable. > > Just realized that we don't need another atomic variable for this, just > use waitqueue_active() before wake_up() should be enough. > > >> > >> Hi, Kairui, > >> > >> Do you have some test cases to test parallel zram swap-in? If so, tha= t > >> can be used to verify whether cache ping-pong is an issue and whether = it > >> can be fixed via a global atomic variable. > >> > > > > Yes, Kairui please run a test on your machine with lots of cores before > > and after adding a global atomic variable as suggested by Ying. I am > > sorry I don't have a server machine. > > > > if it turns out you find cache ping-pong can be an issue, another > > approach would be a waitqueue hash: > > Yes. waitqueue hash may help reduce lock contention. And, we can have > both waitqueue_active() and waitqueue hash if necessary. As the first > step, waitqueue_active() appears simpler. Interesting. Just take a look at the waitqueue_active(), it requires smp_mb() if using without holding the lock. Quote from the comment of waitqueue_active(): * Also note that this 'optimization' trades a spin_lock() for an smp_mb(), * which (when the lock is uncontended) are of roughly equal cost. Chris > > > diff --git a/mm/memory.c b/mm/memory.c > > index 2366578015ad..aae0e532d8b6 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4192,6 +4192,23 @@ static struct folio *alloc_swap_folio(struct vm_= fault *vmf) > > } > > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > > > +/* > > + * Alleviating the 'thundering herd' phenomenon using a waitqueue hash > > + * when multiple do_swap_page() operations occur simultaneously. > > + */ > > +#define SWAPCACHE_WAIT_TABLE_BITS 5 > > +#define SWAPCACHE_WAIT_TABLE_SIZE (1 << SWAPCACHE_WAIT_TABLE_BITS) > > +static wait_queue_head_t swapcache_wqs[SWAPCACHE_WAIT_TABLE_SIZE]; > > + > > +static int __init swapcache_wqs_init(void) > > +{ > > + for (int i =3D 0; i < SWAPCACHE_WAIT_TABLE_SIZE; i++) > > + init_waitqueue_head(&swapcache_wqs[i]); > > + > > + return 0; > > +} > > +late_initcall(swapcache_wqs_init); > > + > > /* > > * We enter with non-exclusive mmap_lock (to exclude vma changes, > > * but allow concurrent faults), and pte mapped but not yet locked. > > @@ -4204,6 +4221,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > { > > struct vm_area_struct *vma =3D vmf->vma; > > struct folio *swapcache, *folio =3D NULL; > > + DECLARE_WAITQUEUE(wait, current); > > + wait_queue_head_t *swapcache_wq; > > struct page *page; > > struct swap_info_struct *si =3D NULL; > > rmap_t rmap_flags =3D RMAP_NONE; > > @@ -4297,12 +4316,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > * undetectable as pte_same() returns tru= e due > > * to entry reuse. > > */ > > + swapcache_wq =3D &swapcache_wqs[hash_long= (vmf->address & PMD_MASK, > > + SWAPCACHE_WAIT_TA= BLE_BITS)]; > > if (swapcache_prepare(entry, nr_pages)) { > > /* > > * Relax a bit to prevent rapid > > * repeated page faults. > > */ > > + add_wait_queue(swapcache_wq, &wai= t); > > schedule_timeout_uninterruptible(= 1); > > + remove_wait_queue(swapcache_wq, &= wait); > > goto out_page; > > } > > need_clear_cache =3D true; > > @@ -4609,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > out: > > /* Clear the swap cache pin for direct swapin after PTL unlock */ > > - if (need_clear_cache) > > + if (need_clear_cache) { > > swapcache_clear(si, entry, nr_pages); > > + wake_up(swapcache_wq); > > + } > > if (si) > > put_swap_device(si); > > return ret; > > @@ -4625,8 +4650,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > folio_unlock(swapcache); > > folio_put(swapcache); > > } > > - if (need_clear_cache) > > + if (need_clear_cache) { > > swapcache_clear(si, entry, nr_pages); > > + wake_up(swapcache_wq); > > + } > > if (si) > > put_swap_device(si); > > return ret; > > -- > Best Regards, > Huang, Ying