From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5C9CCEF16C for ; Tue, 8 Oct 2024 13:08:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 54B346B0088; Tue, 8 Oct 2024 09:08:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D4716B0089; Tue, 8 Oct 2024 09:08:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 34E126B008A; Tue, 8 Oct 2024 09:08:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 142C56B0088 for ; Tue, 8 Oct 2024 09:08:21 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id DE0EE161B10 for ; Tue, 8 Oct 2024 13:08:19 +0000 (UTC) X-FDA: 82650463560.18.C598BEE Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf05.hostedemail.com (Postfix) with ESMTP id C27C1100014 for ; Tue, 8 Oct 2024 13:08:18 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QFnVcLqb; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728392764; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UgnuluNcKJJec3QJRntHlMu9+I2DTyJy+x38c7ZJyRo=; b=zbViYqwAVnURIcd1I96jvlCKuofR1gWtE3F+q+2NgQYc3wFd0n86kSmeoooe0X+5iTjlDO ugBI7PF0oeXWOHStbQkg1v15AJgFuhDbclTZ6Ct3q/RUMBZBqnaMtk8IOuuxtLctuL62lS CC8CLLOjIfR2uji7sYYVggGhgUIfm64= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728392764; a=rsa-sha256; cv=none; b=xgZszn5hzBZGCIMcpgGSPfZjMJXAH3Hat+vYTWk6SdfUefR0GOIqzch8V97vGe2AnYdoTi xka1Uq902SUIfkBd0xn37SGJsihAd6NswuWr9sVEL4CY4jq72w9BrCSNHUryWt96YlU5EU 9sLa2z897hZORbj95C3oVGWqG/8fq3g= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QFnVcLqb; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-20ba8d92af9so41975765ad.3 for ; Tue, 08 Oct 2024 06:08:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728392897; x=1728997697; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=UgnuluNcKJJec3QJRntHlMu9+I2DTyJy+x38c7ZJyRo=; b=QFnVcLqbb6QcVXCoeYMGhpfdOw5OfmH6DMPZlcWCtzUDuA840y6hK88Feh4YRxh0Nz NL5TYhyDavoYytF/ORFeBDAHWp5ftOmCaGigvqhixHQ6w+KwaP55cSWEwVsKl9M9MJya WP0jSLtzTs/H/+jRNv4qyhkpkQXsAn/OE0/RB6bNnRnbT/UkRll8IZMxN9BeihrUf626 9NdaIsOf8sfwluk6Dl8AXfVkpQSzw92Z0FbZPt42Tf/JuRM1cZZrBxwsm6JLsvSaLsvt f1TVvhDUIe+1iiE28gU6C52drdIenFaWqFdn7XN/olv35AuL4Qsqk/cVmFWzbgO4pS/7 5KYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728392897; x=1728997697; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UgnuluNcKJJec3QJRntHlMu9+I2DTyJy+x38c7ZJyRo=; b=GR/hPAS8mRsOB9KL1z2+apdzCB7qHgEh2aprrSNMXPpjk73sjPKLTtc1XTNFuecsHB FLqc7rtfKVtaGzMQhWCGA18IKbScmWVRsWzzkWzjnghlEUkSfcpYOt4SUexSY3k23Xp7 djEzqUirhWYyG5Pj5EZbRHN6zIGVKhTxe3brjLxLw/lpfFzQhyMb78h8v4bDq/p/29LH 86MeV78qWwhNopdLnUqvkCReSLzpiz1bJVLueBsgel2I8Otd6LixlIpXtZeeKncUMS1c ZAlfDtUNgHFaC9pYFB2t+fEtOBHPa9hFfrcAiEMmIODzX+X/5v9dExYPWK4RDuqgaWA6 KFKw== X-Forwarded-Encrypted: i=1; AJvYcCVKp3qrvJOIFBApmvBuaDwsqOiox3Z7CagCWTh9Gk+b9cPMLN+Yi77I164Sde3TGb4Auef/8m6l9w==@kvack.org X-Gm-Message-State: AOJu0YznGXT3dAJYrheAC23X+DAists8gB0wEhwV7OsmkgU/mEXZcLoK FXt7wj/ka/Ebc9/OdVpJK+jdCK6IDoYTv7Ni3PFLB67PRmIS0Wdr X-Google-Smtp-Source: AGHT+IHBfvmIO88nfTBhLNfkV5002Z/kpex9ViTJROxf9dsLZ78y2lNhxglLJQZme5kqCg20iL8pjA== X-Received: by 2002:a17:902:f550:b0:202:371c:3312 with SMTP id d9443c01a7336-20bfe496685mr216706295ad.40.1728392897304; Tue, 08 Oct 2024 06:08:17 -0700 (PDT) Received: from localhost.localdomain ([103.135.144.26]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20c138cb457sm55160495ad.70.2024.10.08.06.08.11 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 08 Oct 2024 06:08:16 -0700 (PDT) From: Barry Song <21cnbao@gmail.com> To: ying.huang@intel.com, akpm@linux-foundation.org Cc: 21cnbao@gmail.com, chrisl@kernel.org, david@redhat.com, hannes@cmpxchg.org, hughd@google.com, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, liyangouwen1@oppo.com, mhocko@suse.com, minchan@kernel.org, sj@kernel.org, stable@vger.kernel.org, surenb@google.com, v-songbaohua@oppo.com, willy@infradead.org, yosryahmed@google.com, yuzhao@google.com Subject: Re: [PATCH] mm: avoid unconditional one-tick sleep when swapcache_prepare fails Date: Tue, 8 Oct 2024 21:08:07 +0800 Message-Id: <20241008130807.40833-1-21cnbao@gmail.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <87ikuani1f.fsf@yhuang6-desk2.ccr.corp.intel.com> References: <87ikuani1f.fsf@yhuang6-desk2.ccr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: C27C1100014 X-Stat-Signature: jwg5q8m6qc9jukt9zop5k96bipoepgwy X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1728392898-475196 X-HE-Meta: U2FsdGVkX1+uVrlJA/ocBLeTG/MNcT9Eb5K1PsfNzfYjH74Zn/E36PKRk9sd/MvnNJ+B0S+XaU8TtamFyPh8YQhbJanSvhzV18hxYwh2mDETKtxnXiZmk+Ci7hmH45BaXlsU+KOdi2z9ADGaZLyTHHoxRTua1puC7vsfLWlP5NomKzgBQgGUJt3Rr9RYX95vfhfmfZuJ9rkXmXe550TfeoFfbSKMHZh3sXLSw2wYmyqm8Ur7032fF2nGhSZQrEi2USQ6bECfvejROmd6mnsANRcP9hiAiCVh42i0bTfF+QOqepjMMV4+TZ9Hfgc88imq9TXoY9k0e1ojIW/p4fSt6bfmA01V6NhkMr2m4x/EmZ7/Ai8m0HRHfhuE7BGD9xnUI2ivaKGfgO0aqVP7iN7IiySDOfVBJTVIHS/CAhzGYgFdAOrjGBLW6oxHUgkfszELfPqWYp3HeOOZOBsRpPyyUGCyFgEVnCobWf6tWTRQqNIeo/Mb9/9g2zgSbxr8auZfjzkne7x0GYMHmoJQZFxqxF+iq83fR3LKfBuNq6U3rSlk8tIADZ//nX7S7pp7XiDWMDlt/1Gnjbj8FrJVNg4MJkizRVjnoLKRgkTTshAoHPOWujCijimPYd/99nl+HR4s0uM9gqr7nRYg0uKuxk1J59p3u20ZddrZSSe+EV5wFCVYDrAeD880rbAWWHo3D1fsQ8LSAzLOeNvLyLlY3go2d2SebOsfxzVDm58+1q392aDYSN4wz8eZJIhfAXYrxomiB42NS7TfdPeQAf157ELaWKHRvCLvVRTbiLNRb1bso+my4nirZ/xjnh+0F0ISdrYyG02dZTbdh+icCVe522Km5FvEEKsY1o+X5oKuyEvi/pC3HKBExT6auWKOiSwPfwjkjlx6nYjlIXSIXp05iLvt3YC3hY/wOFIdT1cX4X2qQ0obE/AcB/SRE8MbsY4qLwly51vp+BUDtPgkxppd8NS hbE/zBxa yrDNvg5gTb66z64H+QlPY2k2qeUcygKa6UaNJUklgnm6zCkMnlMTIaMxo8YbDrF1+cxCUWTf0iKhwEJk+o/ON6Vv4vg+u6dhfKk1jb18/YT2UHkNO4L6WEIThid3SvVzYi1/bwHbr/Se1RlYUPU/iBDbbqmfSh0BLVwibkeutDPpdNPx9vHcf9N8XcnOBXZvLhY4NCbwjHnLHXiHMDMoVoFP/0s9sd54JzF7O/Bc6In0K7CfJMM46QEGeo/M/ZAkIPEoQqo/LvXezquuZtriMRa/SQmg7fjVTnD+uSmMmeNniWjYYx0KteH00Kp3misP+OFI0KQq/juojduCzszPsyvWhJCE9CMvjP4VdpDSxLijqi+fZUvRr+mKl+oCaSn4hOI1W68roxuBmVIp9VvxPi6R5uD6aSA40DfltViDlWSVGYXz2tCTCLAMU4RMiHcT5I+TwuqGQIfV6wZc2hX3uxR6YZN6HxiZkj8YCDxRiXMjT2qgk7q8ywLxM+e7+DGHjwG/VJmAuqIqZivSZTtnUFURPt7eP5ELtIUENK9qFclPx+kElAldsKfvUXotIpFzGOTrpXwAMFfQ5Dh8xeMjulyOOKQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Oct 3, 2024 at 8:35 AM Huang, Ying wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying wrote: > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying wrote: > >> >> > >> >> Barry Song <21cnbao@gmail.com> writes: > >> >> > >> >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying wrote: > >> >> >> > >> >> >> Hi, Barry, > >> >> >> > >> >> >> Barry Song <21cnbao@gmail.com> writes: > >> >> >> > >> >> >> > From: Barry Song > >> >> >> > > >> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") > >> >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()` > >> >> >> > fails, which has led to reports of UI stuttering on latency-sensitive > >> >> >> > Android devices. To address this, we can use a waitqueue to wake up > >> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always > >> >> >> > sleeping for a full tick. While tasks may occasionally be woken by an > >> >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios: > >> >> >> > rapid re-entry into page faults, which can cause livelocks, and > >> >> >> > multiple millisecond sleeps, which visibly degrade user experience. > >> >> >> > >> >> >> In general, I think that this works.  Why not extend the solution to > >> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async() > >> >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid > >> >> > > >> >> > Hi Ying, > >> >> > Thanks for your comments. > >> >> > I feel extending the solution to __read_swap_cache_async() should be done > >> >> > in a separate patch. On phones, I've never encountered any issues reported > >> >> > on that path, so it might be better suited for an optimization rather than a > >> >> > hotfix? > >> >> > >> >> Yes.  It's fine to do that in another patch as optimization. > >> > > >> > Ok. I'll prepare a separate patch for optimizing that path. > >> > >> Thanks! > >> > >> >> > >> >> >> overhead to call wake_up() when there's no task waiting, we can use an > >> >> >> atomic to count waiting tasks. > >> >> > > >> >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty > >> >> > waitqueue should have a very low cost on its own? > >> >> > >> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global > >> >> shared lock.  On systems with many CPUs (such servers), this may cause > >> >> severe lock contention.  Even the cache ping-pong may hurt performance > >> >> much. > >> > > >> > I understand that cache synchronization was a significant issue before > >> > qspinlock, but it seems to be less of a concern after its implementation. > >> > >> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as > >> discussed in the following thread. > >> > >> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.kicks-ass.net/ > >> > >> > However, using a global atomic variable would still trigger cache broadcasts, > >> > correct? > >> > >> We can only change the atomic variable to non-zero when > >> swapcache_prepare() returns non-zero, and call wake_up() when the atomic > >> variable is non-zero.  Because swapcache_prepare() returns 0 most times, > >> the atomic variable is 0 most times.  If we don't change the value of > >> atomic variable, cache ping-pong will not be triggered. > > > > yes. this can be implemented by adding another atomic variable. > > Just realized that we don't need another atomic variable for this, just > use waitqueue_active() before wake_up() should be enough. > > >> > >> Hi, Kairui, > >> > >> Do you have some test cases to test parallel zram swap-in?  If so, that > >> can be used to verify whether cache ping-pong is an issue and whether it > >> can be fixed via a global atomic variable. > >> > > > > Yes, Kairui please run a test on your machine with lots of cores before > > and after adding a global atomic variable as suggested by Ying. I am > > sorry I don't have a server machine. > > > > if it turns out you find cache ping-pong can be an issue, another > > approach would be a waitqueue hash: > > Yes.  waitqueue hash may help reduce lock contention.  And, we can have > both waitqueue_active() and waitqueue hash if necessary.  As the first > step, waitqueue_active() appears simpler. Hi Andrew, If there are no objections, can you please squash the below change? Oven has already tested the change and the original issue was still fixed with it. If you want me to send v2 instead, please let me know. >From a5ca401da89f3b628c3a0147e54541d0968654b2 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Tue, 8 Oct 2024 20:18:27 +0800 Subject: [PATCH] mm: wake_up only when swapcache_wq waitqueue is active wake_up() will acquire spinlock even waitqueue is empty. This might involve cache sync overhead. Let's only call wake_up() when waitqueue is active. Suggested-by: "Huang, Ying" Signed-off-by: Barry Song --- mm/memory.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index fe21bd3beff5..4adb2d0bcc7a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4623,7 +4623,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* Clear the swap cache pin for direct swapin after PTL unlock */ if (need_clear_cache) { swapcache_clear(si, entry, nr_pages); - wake_up(&swapcache_wq); + if (waitqueue_active(&swapcache_wq)) + wake_up(&swapcache_wq); } if (si) put_swap_device(si); @@ -4641,7 +4642,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } if (need_clear_cache) { swapcache_clear(si, entry, nr_pages); - wake_up(&swapcache_wq); + if (waitqueue_active(&swapcache_wq)) + wake_up(&swapcache_wq); } if (si) put_swap_device(si); -- 2.39.3 (Apple Git-146) > > > diff --git a/mm/memory.c b/mm/memory.c > > index 2366578015ad..aae0e532d8b6 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4192,6 +4192,23 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > >  } > >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > > > +/* > > + * Alleviating the 'thundering herd' phenomenon using a waitqueue hash > > + * when multiple do_swap_page() operations occur simultaneously. > > + */ > > +#define SWAPCACHE_WAIT_TABLE_BITS 5 > > +#define SWAPCACHE_WAIT_TABLE_SIZE (1 << SWAPCACHE_WAIT_TABLE_BITS) > > +static wait_queue_head_t swapcache_wqs[SWAPCACHE_WAIT_TABLE_SIZE]; > > + > > +static int __init swapcache_wqs_init(void) > > +{ > > +     for (int i = 0; i < SWAPCACHE_WAIT_TABLE_SIZE; i++) > > +             init_waitqueue_head(&swapcache_wqs[i]); > > + > > +        return 0; > > +} > > +late_initcall(swapcache_wqs_init); > > + > >  /* > >   * We enter with non-exclusive mmap_lock (to exclude vma changes, > >   * but allow concurrent faults), and pte mapped but not yet locked. > > @@ -4204,6 +4221,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >  { > >       struct vm_area_struct *vma = vmf->vma; > >       struct folio *swapcache, *folio = NULL; > > +     DECLARE_WAITQUEUE(wait, current); > > +     wait_queue_head_t *swapcache_wq; > >       struct page *page; > >       struct swap_info_struct *si = NULL; > >       rmap_t rmap_flags = RMAP_NONE; > > @@ -4297,12 +4316,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >                                * undetectable as pte_same() returns true due > >                                * to entry reuse. > >                                */ > > +                             swapcache_wq = &swapcache_wqs[hash_long(vmf->address & PMD_MASK, > > +                                                     SWAPCACHE_WAIT_TABLE_BITS)]; > >                               if (swapcache_prepare(entry, nr_pages)) { > >                                       /* > >                                        * Relax a bit to prevent rapid > >                                        * repeated page faults. > >                                        */ > > +                                     add_wait_queue(swapcache_wq, &wait); > >                                       schedule_timeout_uninterruptible(1); > > +                                     remove_wait_queue(swapcache_wq, &wait); > >                                       goto out_page; > >                               } > >                               need_clear_cache = true; > > @@ -4609,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >               pte_unmap_unlock(vmf->pte, vmf->ptl); > >  out: > >       /* Clear the swap cache pin for direct swapin after PTL unlock */ > > -     if (need_clear_cache) > > +     if (need_clear_cache) { > >               swapcache_clear(si, entry, nr_pages); > > +             wake_up(swapcache_wq); > > +     } > >       if (si) > >               put_swap_device(si); > >       return ret; > > @@ -4625,8 +4650,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >               folio_unlock(swapcache); > >               folio_put(swapcache); > >       } > > -     if (need_clear_cache) > > +     if (need_clear_cache) { > >               swapcache_clear(si, entry, nr_pages); > > +             wake_up(swapcache_wq); > > +     } > >       if (si) > >               put_swap_device(si); > >       return ret; > > -- > Best Regards, > Huang, Ying