From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5E73BCF854A
	for <linux-mm@archiver.kernel.org>; Thu,  3 Oct 2024 00:35:35 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EAA756B04E7; Wed,  2 Oct 2024 20:35:34 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E32E56B04E8; Wed,  2 Oct 2024 20:35:34 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CAD056B04E9; Wed,  2 Oct 2024 20:35:34 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id A1B226B04E7
	for <linux-mm@kvack.org>; Wed,  2 Oct 2024 20:35:34 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 414DB40BB3
	for <linux-mm@kvack.org>; Thu,  3 Oct 2024 00:35:34 +0000 (UTC)
X-FDA: 82630422588.02.77B701F
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11])
	by imf21.hostedemail.com (Postfix) with ESMTP id CA8AE1C000F
	for <linux-mm@kvack.org>; Thu,  3 Oct 2024 00:35:30 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=KSbOorTt;
	spf=pass (imf21.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1727915602;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=LuC2AS0pwvb/+BiwKX6aR9a/VkVaED+wBg9oN1+czzc=;
	b=a6C3z0uPlMsCNMNzm5rQtcgawL4ab/d16pN3Y0H9fPFS8b1AiVGutsrjvJeh7gZqttBIZ6
	tkxon14U0okAOe38g0IxCRQsaXo8Ko2tmM+aqlYFsyCIJ0+kDi1hrMxWaXWfiBX6z7Fhjq
	+1QVGUHHp0jkSYzozxax89iaLpUPqTM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727915602; a=rsa-sha256;
	cv=none;
	b=u83HJ4d8PgzBZHb4gfV7+zdDXImB+ZnhxL0oHTir8A5IfE1E97oFVbMMS66P4WHq2+Vm49
	JbgeegYuAdVYPvw28aPFRtS55cYjeznmmFLZD+ar2pi+AJAimm0PUILJh4cdFoZ8GC9bic
	LA7BXHGmVd3dlPEzaEdP46ETdLNIAuo=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=KSbOorTt;
	spf=pass (imf21.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1727915731; x=1759451731;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=9Z63A4UwMPg7h2BjcWTLocAQBrgxW8vXstQE/Guqqi8=;
  b=KSbOorTte4jhinmJP9bwC6KWGhS+Bo685RiYOddNrU+OXwdYroE8DoZr
   NyVvQN4djRZYF0sDJtrllfB1x9nzL+iQGLsqAw6eNj3dmLGohnu/C/Kvd
   PoMmleEdJjzVysaLTVXiSB/YamZLtIQUOHzytpnQEea6Z1McuHT3hkOyg
   fnidQ9QFqByuvdP6PjqepzrqNN454+5AvCcY5oEHnc9G6RRBLiNI521Wt
   yHzRhB8GiriFLGh9cyjXV30DaxUZRps5FX/oom/xXv21Mn8aQ8hFm09oe
   uQMhhhvOELhmyks2vkhh5OOTpUx3X+ncyo/+HRfQyDKIC5YZ9v0PnUQ0H
   A==;
X-CSE-ConnectionGUID: 5S4QWou1RAqAxGX8tNcaxw==
X-CSE-MsgGUID: CMtqRIozQw+KvA6IXd7a+w==
X-IronPort-AV: E=McAfee;i="6700,10204,11213"; a="37694038"
X-IronPort-AV: E=Sophos;i="6.11,173,1725346800"; 
   d="scan'208";a="37694038"
Received: from fmviesa007.fm.intel.com ([10.60.135.147])
  by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Oct 2024 17:35:23 -0700
X-CSE-ConnectionGUID: z1ZzhR+/RTubqKTYjKGlDg==
X-CSE-MsgGUID: qKKNOiHpRLOzWaZqqA8d4w==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,173,1725346800"; 
   d="scan'208";a="73859316"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Oct 2024 17:35:13 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Barry Song <21cnbao@gmail.com>
Cc: akpm@linux-foundation.org,  chrisl@kernel.org,  david@redhat.com,
  hannes@cmpxchg.org,  hughd@google.com,  kaleshsingh@google.com,
  kasong@tencent.com,  linux-kernel@vger.kernel.org,  linux-mm@kvack.org,
  liyangouwen1@oppo.com,  mhocko@suse.com,  minchan@kernel.org,
  sj@kernel.org,  stable@vger.kernel.org,  surenb@google.com,
  v-songbaohua@oppo.com,  willy@infradead.org,  yosryahmed@google.com,
  yuzhao@google.com
Subject: Re: [PATCH] mm: avoid unconditional one-tick sleep when
 swapcache_prepare fails
In-Reply-To: <20241002015754.969-1-21cnbao@gmail.com> (Barry Song's message of
	"Wed, 2 Oct 2024 09:57:54 +0800")
References: <87y137nxqs.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<20241002015754.969-1-21cnbao@gmail.com>
Date: Thu, 03 Oct 2024 08:31:40 +0800
Message-ID: <87ikuani1f.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: 9xmifmghod3bydyk7pio59cntj6ex4n5
X-Rspamd-Queue-Id: CA8AE1C000F
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1727915730-370925
X-HE-Meta: U2FsdGVkX1/IBb+gmqe2gu4MTfwSYRuD/xudgZG9fmUL7YxWSnPds5EgwHB+xRNofim4nCQq2ykdRrUNeX/bEIjWN1Ibo06BjY/u6Qlsqn6y7hxiFS44mnY8gjHKbpbYz1d8Rb6E55jhXC/TXtJkRtdjX8nq+tRGFg5+Dk/+YriFzMEADFF83XfB80TYn+hhBp0Fdo5VLFzBXg37ekFKmc02u+evUO3X0yqLEhemnhD7qpk4NakL+b66tG6Slr/kV9rflbgNEdMzBgxqRbo2zzKzTM+3G9V7aX6EXSLyxICJeEogQU2r5S5a/W80jbN67GhcDHaBkLtxodyXFAOVoVQqBqO8k7WsrKqM0cW0UsPI1mGJQ1MyPXsxLVHvuVlIeKWpKJf+YbaAWoBQUMkHVcpxWCVW+qWC4mVr3/KiTezM5pHfdDBSHO/NuIpAWXAR2j8J0PiLRFWWoQtLnJWUIXhlcN3P/OcRREINZwh18+U+fX51agRpLSWtebWqJSrX1AK6uomjymkl5/wWZOFJLRyGIOxEQrpdF+/spXmTCYTlclL3FadJaX2sByH3YSemTViMMwimSgRqzMztfSxV/n5Er73gOlG0BFNpPAK3Rm4paz7MggyXJmsez+x4Ae/FQONvwtEg0N0b56oi8zjUIwrwijco8aXJA1wVmO4SUKyxOUyi0qGvAwTcbrCcxQPquDaWyOxBtC7MtlqsTLkz9eK3fMVO+bQEXT+lhfO4l0xXHwR00IlCldzTp/3awE2BVj/Mdi1OdUfGU7ZNXabRmlI9xhdo+RFqvx2hiGc0e04FmOtCRj8IOAecN5AWYH3qEt/c1kx7IDKzwOmlgK+1JuGCFDuXicb3l8vXrRCpRXqO1vO4tRt92huGuliD3UnwaOvICKnKlosqGEM+IdNShfXKvrKoTX63d3HC2Os/W5L+m/v5P1mdJXPDnvfS6LuPBhqM48H9n89giyyDJoA
 3zZjb233
 GU6PdWwEgKfXpcz76477uwMnUKvvvRNzLz4iudUmoerQTrFs3GDNsLtu450gAGr/P0U0x+LnAhv8WmRyqe4hVdBMpBDEvom3u8+m6KdvbSxyIRZHs18MK7TPmDQiqhDt85jPmr+1jAHwFrYRJWV1OkKrx437sOsS6Lzp9JvHkuTzyHlCWWIZtqXkfw4++JKSGoSeV8EQ6FS2uj3jWbU3gu01sFrqjJVliIiJkaa62JoriaayUUgZBPeJNjEk0EvArS2zPb0EEJ6UXtBstcxC+iSHFRHHBKU5UPncvxwR0d5gXYSbRyKdaO2Efg/KiS77+gKaMeeifL5lVJuMJGZPnnUTDkDLDb8JId2YEnfohKSTHnasjmUxrnYE7uw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Barry Song <21cnbao@gmail.com> writes:

> On Wed, Oct 2, 2024 at 8:43=E2=80=AFAM Huang, Ying <ying.huang@intel.com>=
 wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Tue, Oct 1, 2024 at 7:43=E2=80=AFAM Huang, Ying <ying.huang@intel.c=
om> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Sun, Sep 29, 2024 at 3:43=E2=80=AFPM Huang, Ying <ying.huang@int=
el.com> wrote:
>> >> >>
>> >> >> Hi, Barry,
>> >> >>
>> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >>
>> >> >> > From: Barry Song <v-songbaohua@oppo.com>
>> >> >> >
>> >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>> >> >> > introduced an unconditional one-tick sleep when `swapcache_prepa=
re()`
>> >> >> > fails, which has led to reports of UI stuttering on latency-sens=
itive
>> >> >> > Android devices. To address this, we can use a waitqueue to wake=
 up
>> >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
>> >> >> > sleeping for a full tick. While tasks may occasionally be woken =
by an
>> >> >> > unrelated `do_swap_page()`, this method is preferable to two sce=
narios:
>> >> >> > rapid re-entry into page faults, which can cause livelocks, and
>> >> >> > multiple millisecond sleeps, which visibly degrade user experien=
ce.
>> >> >>
>> >> >> In general, I think that this works. =C2=A0Why not extend the solu=
tion to
>> >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_asyn=
c()
>> >> >> too? =C2=A0We can call wake_up() when we clear SWAP_HAS_CACHE. =C2=
=A0To avoid
>> >> >
>> >> > Hi Ying,
>> >> > Thanks for your comments.
>> >> > I feel extending the solution to __read_swap_cache_async() should b=
e done
>> >> > in a separate patch. On phones, I've never encountered any issues r=
eported
>> >> > on that path, so it might be better suited for an optimization rath=
er than a
>> >> > hotfix?
>> >>
>> >> Yes. =C2=A0It's fine to do that in another patch as optimization.
>> >
>> > Ok. I'll prepare a separate patch for optimizing that path.
>>
>> Thanks!
>>
>> >>
>> >> >> overhead to call wake_up() when there's no task waiting, we can us=
e an
>> >> >> atomic to count waiting tasks.
>> >> >
>> >> > I'm not sure it's worth adding the complexity, as wake_up() on an e=
mpty
>> >> > waitqueue should have a very low cost on its own?
>> >>
>> >> wake_up() needs to call spin_lock_irqsave() unconditionally on a glob=
al
>> >> shared lock. =C2=A0On systems with many CPUs (such servers), this may=
 cause
>> >> severe lock contention. =C2=A0Even the cache ping-pong may hurt perfo=
rmance
>> >> much.
>> >
>> > I understand that cache synchronization was a significant issue before
>> > qspinlock, but it seems to be less of a concern after its implementati=
on.
>>
>> Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
>> discussed in the following thread.
>>
>> https://lore.kernel.org/lkml/20220510192708.GQ76023@worktop.programming.=
kicks-ass.net/
>>
>> > However, using a global atomic variable would still trigger cache broa=
dcasts,
>> > correct?
>>
>> We can only change the atomic variable to non-zero when
>> swapcache_prepare() returns non-zero, and call wake_up() when the atomic
>> variable is non-zero. =C2=A0Because swapcache_prepare() returns 0 most t=
imes,
>> the atomic variable is 0 most times. =C2=A0If we don't change the value =
of
>> atomic variable, cache ping-pong will not be triggered.
>
> yes. this can be implemented by adding another atomic variable.

Just realized that we don't need another atomic variable for this, just
use waitqueue_active() before wake_up() should be enough.

>>
>> Hi, Kairui,
>>
>> Do you have some test cases to test parallel zram swap-in? =C2=A0If so, =
that
>> can be used to verify whether cache ping-pong is an issue and whether it
>> can be fixed via a global atomic variable.
>>
>
> Yes, Kairui please run a test on your machine with lots of cores before
> and after adding a global atomic variable as suggested by Ying. I am
> sorry I don't have a server machine.
>
> if it turns out you find cache ping-pong can be an issue, another
> approach would be a waitqueue hash:

Yes.  waitqueue hash may help reduce lock contention.  And, we can have
both waitqueue_active() and waitqueue hash if necessary.  As the first
step, waitqueue_active() appears simpler.

> diff --git a/mm/memory.c b/mm/memory.c
> index 2366578015ad..aae0e532d8b6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4192,6 +4192,23 @@ static struct folio *alloc_swap_folio(struct vm_fa=
ult *vmf)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>=20=20
> +/*
> + * Alleviating the 'thundering herd' phenomenon using a waitqueue hash
> + * when multiple do_swap_page() operations occur simultaneously.
> + */
> +#define SWAPCACHE_WAIT_TABLE_BITS 5
> +#define SWAPCACHE_WAIT_TABLE_SIZE (1 << SWAPCACHE_WAIT_TABLE_BITS)
> +static wait_queue_head_t swapcache_wqs[SWAPCACHE_WAIT_TABLE_SIZE];
> +
> +static int __init swapcache_wqs_init(void)
> +{
> +	for (int i =3D 0; i < SWAPCACHE_WAIT_TABLE_SIZE; i++)
> +		init_waitqueue_head(&swapcache_wqs[i]);
> +
> +        return 0;
> +}
> +late_initcall(swapcache_wqs_init);
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4204,6 +4221,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma =3D vmf->vma;
>  	struct folio *swapcache, *folio =3D NULL;
> +	DECLARE_WAITQUEUE(wait, current);
> +	wait_queue_head_t *swapcache_wq;
>  	struct page *page;
>  	struct swap_info_struct *si =3D NULL;
>  	rmap_t rmap_flags =3D RMAP_NONE;
> @@ -4297,12 +4316,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  				 * undetectable as pte_same() returns true due
>  				 * to entry reuse.
>  				 */
> +				swapcache_wq =3D &swapcache_wqs[hash_long(vmf->address & PMD_MASK,
> +							SWAPCACHE_WAIT_TABLE_BITS)];
>  				if (swapcache_prepare(entry, nr_pages)) {
>  					/*
>  					 * Relax a bit to prevent rapid
>  					 * repeated page faults.
>  					 */
> +					add_wait_queue(swapcache_wq, &wait);
>  					schedule_timeout_uninterruptible(1);
> +					remove_wait_queue(swapcache_wq, &wait);
>  					goto out_page;
>  				}
>  				need_clear_cache =3D true;
> @@ -4609,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
>  	/* Clear the swap cache pin for direct swapin after PTL unlock */
> -	if (need_clear_cache)
> +	if (need_clear_cache) {
>  		swapcache_clear(si, entry, nr_pages);
> +		wake_up(swapcache_wq);
> +	}
>  	if (si)
>  		put_swap_device(si);
>  	return ret;
> @@ -4625,8 +4650,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		folio_unlock(swapcache);
>  		folio_put(swapcache);
>  	}
> -	if (need_clear_cache)
> +	if (need_clear_cache) {
>  		swapcache_clear(si, entry, nr_pages);
> +		wake_up(swapcache_wq);
> +	}
>  	if (si)
>  		put_swap_device(si);
>  	return ret;

--
Best Regards,
Huang, Ying